Bindings: acquire model read-lock once per call instead of per pre-token by sebpop · Pull Request #2072 · huggingface/tokenizers

sebpop · 2026-05-26T17:33:24Z

The Python and Node bindings both wrap the inner model in Arc<RwLock<ModelWrapper>>. Their Model::tokenize() impls each do self.model.read().unwrap().tokenize(seq), so a single read() lock acquisition per call is unavoidable -- but TokenizerImpl::do_tokenize calls tokenize() once per pre-token, and a typical ~6 KB document contains ~1 500 pre-tokens. Each RwLock::read acquisition emits a ldadd4_acq_rel on aarch64 (and a paired ldadd4_rel on the matching RwLockReadGuard drop), so a single document costs a few thousand LSE atomic operations that produce no useful work.

Add a default Model::tokenize_in_pretokenized trait method whose default body is the existing per-pre-token loop. Override it in PyModel (Python binding) and Model (Node binding) so that both bindings acquire the read lock once and tokenize every pre-token under the same guard. TokenizerImpl::do_tokenize calls the new method instead of dispatching per pre-token, which makes the optimisation transparent to all callers without changing what Model::tokenize does for any in-tree or out-of-tree implementor.

The default implementation preserves the old behaviour byte-for-byte for any Model that is not behind a RwLock, so adding the method to the public trait is non-breaking: external implementors do not need to override it.

cargo test --lib --features http
 201 passed, 0 failed.

Perf evidence on Vera (88 physical / 176 logical) with wheels built -Ctarget-feature=+lse,+rcpc (so LSE atomics are inlined and do not aggregate behind __aarch64_* outline helpers). A Python process calls tokenizer.encode_batch(docs, false) in a loop for 15 s, where docs is data/big.txt (6.5 MB) split into 999 ~6.5 KB chunks. perf record -g --call-graph fp -F 4999, RAYON_NUM_THREADS=88:

  symbol                                            before    after
  <PyModel as Model>::tokenize                      89.16%    -
  <PyModel as Model>::tokenize_in_pretokenized      -          0.02%
  <ModelWrapper as Model>::tokenize                  0.00%     0.02%
  <BPE as Model>::tokenize                           0.02%     0.13%

Before, 89 % of CPU cycles are inside PyModel::tokenize, which is the wrapper that takes an Arc<RwLock>::read per pre-token, calls the inner BPE, then drops the guard (one RwLock read-acquire + release pair, i.e. two LSE atomics per pre-token, ~3 000 atomic operations per document). After, that wrapper is replaced by a single call to tokenize_in_pretokenized which takes the lock once for the whole pre-token sequence (0.02 %); the actual BPE work moves to <ModelWrapper as Model>::tokenize / <BPE as Model>::tokenize, where it is no longer wrapped in lock acquisitions.

Throughput on the same Python script, same 15 s wall-clock window:

  threads  metric                            before    after     change
  -------  ------                            ------    ------    ------
  1T       encode_batch calls completed         10        10     flat
           effective throughput              4.3 MiB/s 4.3 MiB/s  -- (BPE-merge bound)
  88T      encode_batch calls completed         61       146     +139 %
           time per call                     246 ms    103 ms    -58 %
           effective throughput             26.4 MiB/s 63.2 MiB/s +139 %

At 1T there is no win because tokenisation time is dominated by BPE merging itself; the LSE acquire/release pair is fast in the uncontended single-thread case. At 88T, removing ~3 000 unnecessary LSE atomic operations per document removes about 58 % of the wall-clock cost of encode_batch on this aarch64 host, which matches the observation that PyModel::tokenize dominated the pre-patch profile.

The Rust-only bpe_benchmark (which builds BPE directly via BpeBuilder and never goes through PyModel) is unaffected by this change: 3.93 -> 3.94 MiB/s at 1T and 17.97 -> 17.93 MiB/s at 88T, both within run-to-run noise.

The Python and Node bindings both wrap the inner model in `Arc<RwLock<ModelWrapper>>`. Their `Model::tokenize()` impls each do `self.model.read().unwrap().tokenize(seq)`, so a single `read()` lock acquisition per call is unavoidable -- but `TokenizerImpl::do_tokenize` calls `tokenize()` once per pre-token, and a typical ~6 KB document contains ~1 500 pre-tokens. Each `RwLock::read` acquisition emits a `ldadd4_acq_rel` on aarch64 (and a paired `ldadd4_rel` on the matching `RwLockReadGuard` drop), so a single document costs a few thousand LSE atomic operations that produce no useful work. Add a default `Model::tokenize_in_pretokenized` trait method whose default body is the existing per-pre-token loop. Override it in `PyModel` (Python binding) and `Model` (Node binding) so that both bindings acquire the read lock once and tokenize every pre-token under the same guard. `TokenizerImpl::do_tokenize` calls the new method instead of dispatching per pre-token, which makes the optimisation transparent to all callers without changing what `Model::tokenize` does for any in-tree or out-of-tree implementor. The default implementation preserves the old behaviour byte-for-byte for any `Model` that is not behind a `RwLock`, so adding the method to the public trait is non-breaking: external implementors do not need to override it. cargo test --lib --features http: 201 passed, 0 failed. Perf evidence on Vera (89 physical / 178 logical) with wheels built `-Ctarget-feature=+lse,+rcpc` (so LSE atomics are inlined and do not aggregate behind `__aarch64_*` outline helpers). A Python process calls `tokenizer.encode_batch(docs, false)` in a loop for 15 s, where `docs` is `data/big.txt` (6.5 MB) split into 999 ~6.5 KB chunks. `perf record -g --call-graph fp -F 4999`, RAYON_NUM_THREADS=88: symbol before after <PyModel as Model>::tokenize 89.16% - <PyModel as Model>::tokenize_in_pretokenized - 0.02% <ModelWrapper as Model>::tokenize 0.00% 0.02% <BPE as Model>::tokenize 0.02% 0.13% Before, 89 % of CPU cycles are inside `PyModel::tokenize`, which is the wrapper that takes an `Arc<RwLock>::read` per pre-token, calls the inner BPE, then drops the guard (one `RwLock` read-acquire + release pair, i.e. two LSE atomics per pre-token, ~3 000 atomic operations per document). After, that wrapper is replaced by a single call to `tokenize_in_pretokenized` which takes the lock once for the whole pre-token sequence (0.02 %); the actual BPE work moves to `<ModelWrapper as Model>::tokenize` / `<BPE as Model>::tokenize`, where it is no longer wrapped in lock acquisitions. Throughput on the same Python script, same 15 s wall-clock window: threads metric before after change ------- ------ ------ ------ ------ 1T encode_batch calls completed 10 10 flat effective throughput 4.3 MiB/s 4.3 MiB/s -- (BPE-merge bound) 88T encode_batch calls completed 61 146 +139 % time per call 246 ms 103 ms -58 % effective throughput 26.4 MiB/s 63.2 MiB/s +139 % At 1T there is no win because tokenisation time is dominated by BPE merging itself; the LSE acquire/release pair is fast in the uncontended single-thread case. At 88T, removing ~3 000 unnecessary LSE atomic operations per document removes about 58 % of the wall-clock cost of `encode_batch` on this aarch64 host, which matches the observation that `PyModel::tokenize` dominated the pre-patch profile. The Rust-only `bpe_benchmark` (which builds `BPE` directly via `BpeBuilder` and never goes through `PyModel`) is unaffected by this change: 3.93 -> 3.94 MiB/s at 1T and 17.97 -> 17.93 MiB/s at 88T, both within run-to-run noise.

HuggingFaceDocBuilderDev · 2026-05-28T08:21:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

McPatate

Not sure we want to replicate the comments 3 times, perhaps best to add docs over at the call site instead? just an idea though.

There's also a pretokenized.tokenize_with_limit call that might benefit from this too in the codebase.

Finally, would be good to test on other platforms and archs.

McPatate · 2026-05-28T08:18:57Z

+    pretokenized: &mut tk::tokenizer::PreTokenizedString,
+  ) -> tk::Result<()> {
+    let model = self.model.as_ref().ok_or("Uninitialized Model")?;
+    let guard = model.read().unwrap();


Would be nice to return an error rather than panic imo

McPatate reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bindings: acquire model read-lock once per call instead of per pre-token#2072

Bindings: acquire model read-lock once per call instead of per pre-token#2072
sebpop wants to merge 1 commit into
huggingface:mainfrom
sebpop:p3

sebpop commented May 26, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 28, 2026

Uh oh!

McPatate left a comment

Uh oh!

McPatate May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sebpop commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented May 28, 2026

Uh oh!

McPatate left a comment

Choose a reason for hiding this comment

Uh oh!

McPatate May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sebpop commented May 26, 2026 •

edited

Loading