Skip to content

🚨 Refactor a bit add_tokens logic: fix bytelevel decode of added tokens + less memory deserialization 🚨#1995

Merged
ArthurZucker merged 51 commits into
mainfrom
fix-byte-norm
Apr 10, 2026
Merged

🚨 Refactor a bit add_tokens logic: fix bytelevel decode of added tokens + less memory deserialization 🚨#1995
ArthurZucker merged 51 commits into
mainfrom
fix-byte-norm

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker commented Mar 27, 2026

Summary

  1. 🚨 When adding the token's content we should normalize it if normalized=True 🚨
    This is quite a change, but should mostly affect internal representation.
    Just realized that if content is normalized, when we serialize/deserialize we'll have an issue.
    Might as well just save a new _normalized field that will be internal + default to content?

  2. Change the algo to simplify. Consume the added tokens especially when deserializing → no clone. This should give mild performance boost now that we have daachorse.

  3. Add python test for ByteLevel decode with normalizer.
    The issue before the PR is that you had to manually normalize the tokens you are adding.

  4. Previously every add_tokens call renormalized the entire added vocab to build the regex; now it only normalizes what's new. This can be wrong if the normalizer changes — we could guard against this by updating the tokenizer.normalizer setter to refresh added tokens.

Benchmark results (added_vocab_deserialize, 100k tokens)

Ran against up-to-date main (includes daachorse, #1999) to confirm zero overhead:

Variant main fix-byte-norm
special tokens — no normalizer 232 ms 222 ms
non-special tokens — no normalizer 258 ms 251 ms
special tokens — nfkc 226 ms 221 ms
non-special tokens — nfkc 263 ms 260 ms

No measurable regression introduced by this PR.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker ArthurZucker requested a review from McPatate March 27, 2026 21:28
@ArthurZucker ArthurZucker changed the title Draft update of added token refreshing which is a bottleneck Fix bytelevel decode of added tokens Mar 27, 2026
Comment thread bindings/python/tests/bindings/test_tokenizer.py Outdated
@ArthurZucker ArthurZucker changed the title Fix bytelevel decode of added tokens Fix bytelevel decode of added tokens + 27x faster deserialization Mar 27, 2026
@ArthurZucker ArthurZucker changed the title Fix bytelevel decode of added tokens + 27x faster deserialization Refactor a bit add_tokens logic: fix bytelevel decode of added tokens + faster deserialization Mar 30, 2026
The daachorse library swap is now isolated in PR #1999. This restores
aho-corasick while keeping all the algorithmic improvements:
- refresh_added_tokens uses added_tokens_map_r directly (no token_to_id loop)
- MatchingSet is Option<(AhoCorasick, Vec<u32>)> for empty-trie fast-path
- AddedVocabulary no longer maintains redundant added_tokens/special_tokens Vecs
Copy link
Copy Markdown
Member

@McPatate McPatate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great stuff!

Comment thread tokenizers/examples/profile_added_vocab_deserialize.rs Outdated
Comment thread tokenizers/Cargo.toml Outdated
Comment thread tokenizers/src/tokenizer/added_vocabulary.rs
@@ -254,47 +265,49 @@ impl AddedVocabulary {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tokens: impl IntoIterator<Item = AddedToken>,

Comment thread tokenizers/src/tokenizer/added_vocabulary.rs Outdated
Comment thread tokenizers/src/tokenizer/added_vocabulary.rs Outdated
Comment thread tokenizers/src/tokenizer/added_vocabulary.rs Outdated
Comment thread tokenizers/src/tokenizer/added_vocabulary.rs Outdated
Comment thread tokenizers/src/tokenizer/added_vocabulary.rs Outdated
ArthurZucker and others added 12 commits April 9, 2026 09:27
…ialEq

- Replace .unwrap()/.expect() with ? in add_tokens, refresh_normalized_tokens,
  refresh_added_tokens, and with_normalizer (return Result instead of panicking)
- Remove seed_normalized_cache and double iteration during deserialization;
  let add_tokens compute normalization directly
- Use PartialEq for duplicate token comparison instead of manual field checks
  (also fixes missing single_word comparison)
- Update all callers in tests, benchmarks, examples, and Python/Node bindings
@huggingface huggingface deleted a comment from github-actions Bot Apr 10, 2026
@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

/benchmark

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 10, 2026

Python Benchmark results

Commit: ce0a769cb4b9baea8ee77420637cac5656e7d7cd

Benchmark Baseline (ms) This run (ms) Δ
test_async_encode_batch 1305.2 1296.1 -0.7%
test_async_encode_batch_fast 1054.6 1031.7 -2.2%
test_decode_batch 2.4 2.4 +2.3%
test_encode 2545.9 2694.3 +5.8%
test_encode_batch 1301.0 1303.5 +0.2%
test_encode_batch_multithreaded 1289.6 1276.2 -1.0%
test_encode_fast 1043.3 1033.1 -1.0%
test_from_file_albert 45.4 47.1 +3.8%
test_from_file_llama3 408.7 410.3 +0.4%
test_from_file_roberta 76.1 76.2 +0.2%
test_from_str_llama3 389.0 383.4 -1.4%
test_to_str_llama3 107.2 107.8 +0.6%
test_train_bpe_small 16.2 16.6 +2.0%

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 10, 2026

Rust Benchmark results

Commit: ce0a769cb4b9baea8ee77420637cac5656e7d7cd

Benchmark Baseline (ns/iter) This run (ns/iter) Δ
bpe-gpt2/encode 1815016018 1822885903 0%
bpe-gpt2/encode-batch 883721924 888008576 0%
bpe-gpt2/encode-batch-no-cache 1024733230 1050980682 +2%
bpe-gpt2/encode-no-cache 2345818394 2345331343 0%
llama3/concurrent-4t 76814529 49391017 -35%
llama3/encode 1754898015 1737460249 0%
llama3/encode-batch 867783684 859803718 0%
llama3/encode-char-offsets 1067309310 1058852424 0%
llama3/encode-fast 1672139715 1635931643 -2%
serialization/bpe-from-file-gpt2 47651117 45311913 -4%
serialization/deserialize-llama3 405279321 376483596 -7%
serialization/deserialize-roberta 74238789 71004040 -4%
serialization/from-file-albert 36663177 33642120 -8%
serialization/from-file-llama3 371594895 353979525 -4%
serialization/from-file-roberta 62753817 59109346 -5%
serialization/save-llama3 109097437 95735071 -12%
train/bpe-small 17622182 17347387 -1%

The normalized cache is a derived value from content + normalizer.
Serializing it is wrong because the normalizer may have changed since
the file was saved. The cache is always rebuilt during add_tokens()
using the active normalizer.
@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

/benchmark

Comment thread tokenizers/Cargo.toml Outdated
ArthurZucker and others added 2 commits April 10, 2026 16:10
Consistent with add_tokens — consumes tokens instead of borrowing,
avoiding clones in the internal forwarding call.
Comment thread .github/workflows/CI.yml Outdated
@ArthurZucker ArthurZucker merged commit d863e6e into main Apr 10, 2026
35 checks passed
@ArthurZucker ArthurZucker deleted the fix-byte-norm branch April 10, 2026 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants