🚨 Refactor a bit add_tokens logic: fix bytelevel decode of added tokens + less memory deserialization 🚨#1995
Merged
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
8 tasks
ArthurZucker
commented
Mar 27, 2026
The daachorse library swap is now isolated in PR #1999. This restores aho-corasick while keeping all the algorithmic improvements: - refresh_added_tokens uses added_tokens_map_r directly (no token_to_id loop) - MatchingSet is Option<(AhoCorasick, Vec<u32>)> for empty-trie fast-path - AddedVocabulary no longer maintains redundant added_tokens/special_tokens Vecs
McPatate
reviewed
Apr 8, 2026
| @@ -254,47 +265,49 @@ impl AddedVocabulary { | |||
Member
There was a problem hiding this comment.
Suggested change
| tokens: impl IntoIterator<Item = AddedToken>, |
…ialEq - Replace .unwrap()/.expect() with ? in add_tokens, refresh_normalized_tokens, refresh_added_tokens, and with_normalizer (return Result instead of panicking) - Remove seed_normalized_cache and double iteration during deserialization; let add_tokens compute normalization directly - Use PartialEq for duplicate token comparison instead of manual field checks (also fixes missing single_word comparison) - Update all callers in tests, benchmarks, examples, and Python/Node bindings
Collaborator
Author
|
/benchmark |
Python Benchmark resultsCommit:
|
Rust Benchmark resultsCommit:
|
The normalized cache is a derived value from content + normalizer. Serializing it is wrong because the normalizer may have changed since the file was saved. The cache is always rebuilt during add_tokens() using the active normalizer.
Collaborator
Author
|
/benchmark |
ArthurZucker
commented
Apr 10, 2026
Consistent with add_tokens — consumes tokens instead of borrowing, avoiding clones in the internal forwarding call.
ArthurZucker
commented
Apr 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
🚨 When adding the token's content we should normalize it if
normalized=True🚨This is quite a change, but should mostly affect internal representation.
Just realized that if content is normalized, when we serialize/deserialize we'll have an issue.
Might as well just save a new
_normalizedfield that will be internal + default to content?Change the algo to simplify. Consume the added tokens especially when deserializing → no clone. This should give mild performance boost now that we have
daachorse.Add python test for ByteLevel decode with normalizer.
The issue before the PR is that you had to manually normalize the tokens you are adding.
Previously every
add_tokenscall renormalized the entire added vocab to build the regex; now it only normalizes what's new. This can be wrong if the normalizer changes — we could guard against this by updating thetokenizer.normalizersetter to refresh added tokens.Benchmark results (
added_vocab_deserialize, 100k tokens)Ran against up-to-date
main(includes daachorse, #1999) to confirm zero overhead:mainfix-byte-normNo measurable regression introduced by this PR.