Skip to content

Optimizations for significantly faster tokenizer loading#303

Open
DePasqualeOrg wants to merge 33 commits intohuggingface:mainfrom
DePasqualeOrg:tokenizer-optimizations
Open

Optimizations for significantly faster tokenizer loading#303
DePasqualeOrg wants to merge 33 commits intohuggingface:mainfrom
DePasqualeOrg:tokenizer-optimizations

Conversation

@DePasqualeOrg
Copy link
Contributor

@DePasqualeOrg DePasqualeOrg commented Dec 27, 2025

Currently, tokenizer loading is a major performance bottleneck in Swift, typically taking ~1400 ms compared to ~100 ms in Python.

This PR optimizes tokenizer loading for a 3.6x speedup, saving ~1020 ms. #302 and #304 in swift-transformers and #21 in swift-huggingface add further performance gains.

Performance

Metric Before This PR Speedup
Load time 1411ms 391ms 3.6x (1020 ms)

Tested with Qwen/Qwen3-0.6B-Base tokenizer (150k vocab, 150k merges) on M3 MacBook Pro.

A benchmark is included in LoadingBenchmarks.swift, which can be removed before merging or kept to measure the impact of future changes. Run it with swift test --filter LoadingBenchmarks.

Problems with the Current Implementation

  1. Config wrapper overhead: Config.convertToBinaryDistinctKeys() recursively wraps every JSON value. For 150k vocab + 150k merges, this means 300k+ object allocations taking ~1 second — only to be immediately unwrapped.

  2. Expensive merge lookups: [BytePair: Int] uses string hashing for merge rank lookups, which is slow for 150k entries.

  3. Sequential initialization: Expensive dictionary building happens sequentially, leaving CPU cores idle.

Solutions Implemented in this PR

1. Config Bypass (~1000ms saved)

Extract vocab/merges directly from raw JSON before Config conversion, passing them to new fast-path initializers.

2. Integer-Packed Merge Keys (~180ms saved)

Replace [BytePair: Int] (expensive string hashing) with [UInt64: Int] (fast integer hashing):

// Pack two token IDs into one UInt64
let key = UInt64(tokenIdA) << 32 | UInt64(tokenIdB)

3. Parallel Dictionary Building (~106ms saved)

Use async let to build dictionaries concurrently:

// Phase 1: Independent tasks
async let tokensToIdsTask = buildTokensToIds(...)
async let mergesTask = parseMerges(...)

// Phase 2: Dependent tasks (after Phase 1 completes)
async let bpeRanksTask = buildBpeRanks(...)
async let idsToTokensTask = buildIdsToTokens(...)

4. Conditional stringToId for Unicode Edge Cases

Added an optional stringToId fallback dictionary for tokenizers with Unicode edge cases (e.g., Gemma's BOM-prefixed tokens). Only built when needed — most tokenizers skip this entirely, saving ~50ms.

Backward Compatibility

Standard usage benefits from the faster loading automatically:

let tokenizer = try await AutoTokenizer.from(pretrained: "model-name")
let tokenizer = try await AutoTokenizer.from(modelFolder: url)

Direct use of LanguageModelConfigurationFromHub continues to work unchanged. The default behavior preserves tokenizerData.model.vocab and tokenizerData.model.merges for backward compatibility.

For callers who want the performance optimization, opt in with stripVocabForPerformance: true and use the new properties:

let config = LanguageModelConfigurationFromHub(
    modelName: "model-name",
    stripVocabForPerformance: true
)
let vocab = try await config.tokenizerVocab   // NSDictionary for BPE
let merges = try await config.tokenizerMerges // [Any] for BPE

Custom Tokenizer Registration

Added AutoTokenizer.register(_:for:) for registering custom tokenizer classes:

AutoTokenizer.register(MyTokenizer.self, for: "MyCustomTokenizer")

This mirrors Python transformers' AutoTokenizer.register(), which populates REGISTERED_TOKENIZER_CLASSES for lookup by class name.

This makes it easy for downstream projects like mlx-swift-lm to use the fast path via AutoTokenizer.from() while still supporting custom tokenizer classes when needed.

Testing

All existing tests pass.

Alignment with Python

These optimizations align with patterns in the Python tokenizers library:

  • Packed merge keys ≈ tokenizers Pair type with tuple hashing
  • Parallel building ≈ tokenizers Rayon-based parallelism
  • Raw JSON extraction ≈ tokenizers convert_to_native_format()
  • NSString for Unicode ≈ tokenizers byte-level preservation

Future Work

A new major version could make breaking changes for better ergonomics:

  • Async-first API: Tokenizer.load(from:) as primary entry point
  • Factory methods instead of protocol-mandated initializers
  • Hide Unicode complexity behind internal TokenStorage type

This would reduce BPETokenizer from 4 initializers to 1 factory method while maintaining performance.

@DePasqualeOrg DePasqualeOrg force-pushed the tokenizer-optimizations branch from 93f0a84 to 070e687 Compare December 27, 2025 21:39
@DePasqualeOrg
Copy link
Contributor Author

#304 should be merged before this PR, which depends on it.

@DePasqualeOrg DePasqualeOrg force-pushed the tokenizer-optimizations branch from 4e611a9 to 0277421 Compare January 5, 2026 09:08
@DePasqualeOrg DePasqualeOrg force-pushed the tokenizer-optimizations branch from 0277421 to 568fac6 Compare January 5, 2026 09:24
@DePasqualeOrg DePasqualeOrg force-pushed the tokenizer-optimizations branch 3 times, most recently from 78de59d to 68e3879 Compare January 5, 2026 11:03
@DePasqualeOrg DePasqualeOrg force-pushed the tokenizer-optimizations branch from 042e7e9 to 17b77de Compare January 5, 2026 11:23
Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the comprehensive work, but it's difficult for me to understand all the consequences of the various changes here, and the complexity vs performance tradeoff; it would have been useful to focus on the most impactful change first (skip config extraction of large containers). Having said that, the replacement of BytePair with UInt64 looks reasonable, and the ability to register tokenizers is convenient and we should probably have done it before.

I don't understand where the 50ms of the stringToId fallback dictionary come from, as it appears that we are always building the dictionary and then discarding it (most times).

I'm mostly concerned about backward compatibility and missing edge cases. I'd also like to better understand where we might be sacrificing clarity for negligible performance savings. I'll download your branch and go through it locally.

Are the performance numbers you cite measured on top of #304? That is, is the 1411ms "before" time measured from main, or from the branch associated to #304? What are the incremental gains of the optimizations from this PR, assuming #304 was merged?

@DePasqualeOrg
Copy link
Contributor Author

I can address the open questions after #304 is merged, perhaps by splitting this PR into a few more focused PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants