🚨 Refactor a bit add_tokens logic: fix bytelevel decode of added tokens + less memory deserialization 🚨 by ArthurZucker · Pull Request #1995 · huggingface/tokenizers

ArthurZucker · 2026-03-27T21:24:35Z

Summary

🚨 When adding the token's content we should normalize it if normalized=True 🚨
This is quite a change, but should mostly affect internal representation.
Just realized that if content is normalized, when we serialize/deserialize we'll have an issue.
Might as well just save a new _normalized field that will be internal + default to content?
Change the algo to simplify. Consume the added tokens especially when deserializing → no clone. This should give mild performance boost now that we have daachorse.
Add python test for ByteLevel decode with normalizer.
The issue before the PR is that you had to manually normalize the tokens you are adding.
Previously every add_tokens call renormalized the entire added vocab to build the regex; now it only normalizes what's new. This can be wrong if the normalizer changes — we could guard against this by updating the tokenizer.normalizer setter to refresh added tokens.

Benchmark results (`added_vocab_deserialize`, 100k tokens)

Ran against up-to-date main (includes daachorse, #1999) to confirm zero overhead:

Variant	`main`	`fix-byte-norm`
special tokens — no normalizer	232 ms	222 ms
non-special tokens — no normalizer	258 ms	251 ms
special tokens — nfkc	226 ms	221 ms
non-special tokens — nfkc	263 ms	260 ms

No measurable regression introduced by this PR.

HuggingFaceDocBuilderDev · 2026-03-27T21:27:37Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

The daachorse library swap is now isolated in PR #1999. This restores aho-corasick while keeping all the algorithmic improvements: - refresh_added_tokens uses added_tokens_map_r directly (no token_to_id loop) - MatchingSet is Option<(AhoCorasick, Vec<u32>)> for empty-trie fast-path - AddedVocabulary no longer maintains redundant added_tokens/special_tokens Vecs

McPatate

great stuff!

McPatate · 2026-04-08T14:03:33Z

Suggested change

tokens: impl IntoIterator<Item = AddedToken>,

…ialEq - Replace .unwrap()/.expect() with ? in add_tokens, refresh_normalized_tokens, refresh_added_tokens, and with_normalizer (return Result instead of panicking) - Remove seed_normalized_cache and double iteration during deserialization; let add_tokens compute normalization directly - Use PartialEq for duplicate token comparison instead of manual field checks (also fixes missing single_word comparison) - Update all callers in tests, benchmarks, examples, and Python/Node bindings

…nking

…o.toml

ArthurZucker · 2026-04-10T10:35:30Z

/benchmark

github-actions · 2026-04-10T10:35:48Z

Python Benchmark results

Commit: ce0a769cb4b9baea8ee77420637cac5656e7d7cd

Benchmark	Baseline (ms)	This run (ms)	Δ
test_async_encode_batch	1305.2	1296.1	-0.7%
test_async_encode_batch_fast	1054.6	1031.7	-2.2%
test_decode_batch	2.4	2.4	+2.3%
test_encode	2545.9	2694.3	+5.8%
test_encode_batch	1301.0	1303.5	+0.2%
test_encode_batch_multithreaded	1289.6	1276.2	-1.0%
test_encode_fast	1043.3	1033.1	-1.0%
test_from_file_albert	45.4	47.1	+3.8%
test_from_file_llama3	408.7	410.3	+0.4%
test_from_file_roberta	76.1	76.2	+0.2%
test_from_str_llama3	389.0	383.4	-1.4%
test_to_str_llama3	107.2	107.8	+0.6%
test_train_bpe_small	16.2	16.6	+2.0%

github-actions · 2026-04-10T10:41:56Z

Rust Benchmark results

Commit: ce0a769cb4b9baea8ee77420637cac5656e7d7cd

Benchmark	Baseline (ns/iter)	This run (ns/iter)	Δ
bpe-gpt2/encode	1815016018	1822885903	0%
bpe-gpt2/encode-batch	883721924	888008576	0%
bpe-gpt2/encode-batch-no-cache	1024733230	1050980682	+2%
bpe-gpt2/encode-no-cache	2345818394	2345331343	0%
llama3/concurrent-4t	76814529	49391017	-35%
llama3/encode	1754898015	1737460249	0%
llama3/encode-batch	867783684	859803718	0%
llama3/encode-char-offsets	1067309310	1058852424	0%
llama3/encode-fast	1672139715	1635931643	-2%
serialization/bpe-from-file-gpt2	47651117	45311913	-4%
serialization/deserialize-llama3	405279321	376483596	-7%
serialization/deserialize-roberta	74238789	71004040	-4%
serialization/from-file-albert	36663177	33642120	-8%
serialization/from-file-llama3	371594895	353979525	-4%
serialization/from-file-roberta	62753817	59109346	-5%
serialization/save-llama3	109097437	95735071	-12%
train/bpe-small	17622182	17347387	-1%

The normalized cache is a derived value from content + normalizer. Serializing it is wrong because the normalizer may have changed since the file was saved. The cache is always rebuilt during add_tokens() using the active normalizer.

ArthurZucker · 2026-04-10T13:45:24Z

/benchmark

Consistent with add_tokens — consumes tokens instead of borrowing, avoiding clones in the internal forwarding call.

we don't need the model....

540b5be

ArthurZucker requested a review from McPatate March 27, 2026 21:28

ArthurZucker added 4 commits March 27, 2026 22:29

fmt

74b69b3

nit

ae755ad

better version?

5e20a0e

test + the only viable fix

9c374ce

ArthurZucker mentioned this pull request Mar 27, 2026

[Bug] Fast Tokenizer (ByteLevel BPE) incorrectly decodes added_tokens containing specific Unicode characters (e.g., 'č' becomes '\r') #1996

Closed

8 tasks

ArthurZucker added 3 commits March 27, 2026 23:45

up

6f6c305

fmt

0ed3997

nit

f432a88

ArthurZucker changed the title ~~Draft update of added token refreshing which is a bottleneck~~ Fix bytelevel decode of added tokens Mar 27, 2026

ArthurZucker commented Mar 27, 2026

View reviewed changes

Comment thread bindings/python/tests/bindings/test_tokenizer.py Outdated

fix

ddb06b9

ArthurZucker changed the title ~~Fix bytelevel decode of added tokens~~ Fix bytelevel decode of added tokens + 27x faster deserialization Mar 27, 2026

ArthurZucker and others added 5 commits March 28, 2026 00:18

fmt + clippy

1188bdc

Merge branch 'main' into fix-byte-norm

cf49318

update the added vocab bench to reflect real world usecases

9f65ea6

big update, squeeze even more perfs

8c33ac3

profile exampl.e

f52989d

ArthurZucker changed the title ~~Fix bytelevel decode of added tokens + 27x faster deserialization~~ Refactor a bit add_tokens logic: fix bytelevel decode of added tokens + faster deserialization Mar 30, 2026

ArthurZucker added 4 commits March 30, 2026 12:02

fmt

2269694

bench: reduce normalizers to 2 (none, nfkc) for faster runs

94d962b

bench: fix sample_size to 10 (criterion minimum)

e509a74

bench: reduce to 100k tokens only for practical CI runtime

5603cf1

ArthurZucker mentioned this pull request Mar 30, 2026

perf: replace aho-corasick with daachorse for added vocabulary matching #1999

Merged

ArthurZucker added 4 commits March 30, 2026 13:27

mut not required

6f8cc8e

just fmt

e000f7d

nits

05eeaec

ArthurZucker added 2 commits April 8, 2026 13:46

fix serialzie

363b59a

fmt + fix test

d83655c

McPatate reviewed Apr 8, 2026

View reviewed changes

ArthurZucker and others added 12 commits April 9, 2026 09:27

refactor: use ToPyResult for cleaner error conversion in Python bindings

55b6fbd

up?

d1ae6e9

patch

d4e7d36

order

43b6976

ci: pin macOS Python to 3.13 (3.14 breaks abi3 linking)

ae50be3

ci: add -undefined dynamic_lookup for macOS abi3 cross-compilation li…

29bb7f1

…nking

fix: rustfmt

4ee5e8a

fix: remove missing profile_added_vocab_deserialize example from Carg…

c05657b

…o.toml

fix: rustfmt node bindings

a4a3d81

Merge branch 'main' into fix-byte-norm

b83f702

fix: ci_benchmark add_tokens signature compat

ef9119b

huggingface deleted a comment from github-actions Bot Apr 10, 2026

ArthurZucker added 3 commits April 10, 2026 12:44

fix: clippy unused Result warning

4ceea3f

fix: support TOKENIZERS_DATA_DIR env var in Python benchmarks

849569d

ArthurZucker commented Apr 10, 2026

View reviewed changes

Comment thread tokenizers/Cargo.toml Outdated

ArthurZucker and others added 2 commits April 10, 2026 16:10

Update tokenizers/Cargo.toml

f030500

refactor: add_special_tokens takes impl IntoIterator<Item = AddedToken>

651a339

Consistent with add_tokens — consumes tokens instead of borrowing, avoiding clones in the internal forwarding call.

ArthurZucker commented Apr 10, 2026

View reviewed changes

Comment thread .github/workflows/CI.yml Outdated

Apply suggestion from @ArthurZucker

c3b3dc3

ArthurZucker merged commit d863e6e into main Apr 10, 2026
35 checks passed

ArthurZucker deleted the fix-byte-norm branch April 10, 2026 14:52

Conversation

ArthurZucker commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark results (added_vocab_deserialize, 100k tokens)

Uh oh!

HuggingFaceDocBuilderDev commented Mar 27, 2026

Uh oh!

Uh oh!

McPatate left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

McPatate Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArthurZucker commented Apr 10, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python Benchmark results

Uh oh!

github-actions Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rust Benchmark results

Uh oh!

ArthurZucker commented Apr 10, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ArthurZucker commented Mar 27, 2026 •

edited

Loading

Benchmark results (`added_vocab_deserialize`, 100k tokens)

github-actions Bot commented Apr 10, 2026 •

edited

Loading

github-actions Bot commented Apr 10, 2026 •

edited

Loading