Optimizations for significantly faster tokenizer loading#303
Optimizations for significantly faster tokenizer loading#303DePasqualeOrg wants to merge 33 commits intohuggingface:mainfrom
Conversation
3b73277 to
93f0a84
Compare
93f0a84 to
070e687
Compare
|
#304 should be merged before this PR, which depends on it. |
4e611a9 to
0277421
Compare
0277421 to
568fac6
Compare
78de59d to
68e3879
Compare
042e7e9 to
17b77de
Compare
yyjson already handles BOM characters in strings correctly.
17b77de to
4080874
Compare
pcuenca
left a comment
There was a problem hiding this comment.
Thanks a lot for the comprehensive work, but it's difficult for me to understand all the consequences of the various changes here, and the complexity vs performance tradeoff; it would have been useful to focus on the most impactful change first (skip config extraction of large containers). Having said that, the replacement of BytePair with UInt64 looks reasonable, and the ability to register tokenizers is convenient and we should probably have done it before.
I don't understand where the 50ms of the stringToId fallback dictionary come from, as it appears that we are always building the dictionary and then discarding it (most times).
I'm mostly concerned about backward compatibility and missing edge cases. I'd also like to better understand where we might be sacrificing clarity for negligible performance savings. I'll download your branch and go through it locally.
Are the performance numbers you cite measured on top of #304? That is, is the 1411ms "before" time measured from main, or from the branch associated to #304? What are the incremental gains of the optimizations from this PR, assuming #304 was merged?
|
I can address the open questions after #304 is merged, perhaps by splitting this PR into a few more focused PRs. |
Currently, tokenizer loading is a major performance bottleneck in Swift, typically taking ~1400 ms compared to ~100 ms in Python.
This PR optimizes tokenizer loading for a 3.6x speedup, saving ~1020 ms. #302 and #304 in swift-transformers and #21 in swift-huggingface add further performance gains.
Performance
Tested with Qwen/Qwen3-0.6B-Base tokenizer (150k vocab, 150k merges) on M3 MacBook Pro.
A benchmark is included in
LoadingBenchmarks.swift, which can be removed before merging or kept to measure the impact of future changes. Run it withswift test --filter LoadingBenchmarks.Problems with the Current Implementation
Config wrapper overhead:
Config.convertToBinaryDistinctKeys()recursively wraps every JSON value. For 150k vocab + 150k merges, this means 300k+ object allocations taking ~1 second — only to be immediately unwrapped.Expensive merge lookups:
[BytePair: Int]uses string hashing for merge rank lookups, which is slow for 150k entries.Sequential initialization: Expensive dictionary building happens sequentially, leaving CPU cores idle.
Solutions Implemented in this PR
1. Config Bypass (~1000ms saved)
Extract vocab/merges directly from raw JSON before
Configconversion, passing them to new fast-path initializers.2. Integer-Packed Merge Keys (~180ms saved)
Replace
[BytePair: Int](expensive string hashing) with[UInt64: Int](fast integer hashing):3. Parallel Dictionary Building (~106ms saved)
Use
async letto build dictionaries concurrently:4. Conditional
stringToIdfor Unicode Edge CasesAdded an optional
stringToIdfallback dictionary for tokenizers with Unicode edge cases (e.g., Gemma's BOM-prefixed tokens). Only built when needed — most tokenizers skip this entirely, saving ~50ms.Backward Compatibility
Standard usage benefits from the faster loading automatically:
Direct use of
LanguageModelConfigurationFromHubcontinues to work unchanged. The default behavior preservestokenizerData.model.vocabandtokenizerData.model.mergesfor backward compatibility.For callers who want the performance optimization, opt in with
stripVocabForPerformance: trueand use the new properties:Custom Tokenizer Registration
Added
AutoTokenizer.register(_:for:)for registering custom tokenizer classes:This mirrors Python transformers'
AutoTokenizer.register(), which populatesREGISTERED_TOKENIZER_CLASSESfor lookup by class name.This makes it easy for downstream projects like mlx-swift-lm to use the fast path via
AutoTokenizer.from()while still supporting custom tokenizer classes when needed.Testing
All existing tests pass.
Alignment with Python
These optimizations align with patterns in the Python tokenizers library:
Pairtype with tuple hashingconvert_to_native_format()Future Work
A new major version could make breaking changes for better ergonomics:
Tokenizer.load(from:)as primary entry pointTokenStoragetypeThis would reduce BPETokenizer from 4 initializers to 1 factory method while maintaining performance.