Skip to content

Bindings: acquire model read-lock once per call instead of per pre-token#2072

Open
sebpop wants to merge 1 commit into
huggingface:mainfrom
sebpop:p3
Open

Bindings: acquire model read-lock once per call instead of per pre-token#2072
sebpop wants to merge 1 commit into
huggingface:mainfrom
sebpop:p3

Conversation

@sebpop
Copy link
Copy Markdown
Contributor

@sebpop sebpop commented May 26, 2026

The Python and Node bindings both wrap the inner model in Arc<RwLock<ModelWrapper>>. Their Model::tokenize() impls each do self.model.read().unwrap().tokenize(seq), so a single read() lock acquisition per call is unavoidable -- but TokenizerImpl::do_tokenize calls tokenize() once per pre-token, and a typical ~6 KB document contains ~1 500 pre-tokens. Each RwLock::read acquisition emits a ldadd4_acq_rel on aarch64 (and a paired ldadd4_rel on the matching RwLockReadGuard drop), so a single document costs a few thousand LSE atomic operations that produce no useful work.

Add a default Model::tokenize_in_pretokenized trait method whose default body is the existing per-pre-token loop. Override it in PyModel (Python binding) and Model (Node binding) so that both bindings acquire the read lock once and tokenize every pre-token under the same guard. TokenizerImpl::do_tokenize calls the new method instead of dispatching per pre-token, which makes the optimisation transparent to all callers without changing what Model::tokenize does for any in-tree or out-of-tree implementor.

The default implementation preserves the old behaviour byte-for-byte for any Model that is not behind a RwLock, so adding the method to the public trait is non-breaking: external implementors do not need to override it.

cargo test --lib --features http
 201 passed, 0 failed.

Perf evidence on Vera (88 physical / 176 logical) with wheels built -Ctarget-feature=+lse,+rcpc (so LSE atomics are inlined and do not aggregate behind __aarch64_* outline helpers). A Python process calls tokenizer.encode_batch(docs, false) in a loop for 15 s, where docs is data/big.txt (6.5 MB) split into 999 ~6.5 KB chunks. perf record -g --call-graph fp -F 4999, RAYON_NUM_THREADS=88:

  symbol                                            before    after
  <PyModel as Model>::tokenize                      89.16%    -
  <PyModel as Model>::tokenize_in_pretokenized      -          0.02%
  <ModelWrapper as Model>::tokenize                  0.00%     0.02%
  <BPE as Model>::tokenize                           0.02%     0.13%

Before, 89 % of CPU cycles are inside PyModel::tokenize, which is the wrapper that takes an Arc<RwLock>::read per pre-token, calls the inner BPE, then drops the guard (one RwLock read-acquire + release pair, i.e. two LSE atomics per pre-token, ~3 000 atomic operations per document). After, that wrapper is replaced by a single call to tokenize_in_pretokenized which takes the lock once for the whole pre-token sequence (0.02 %); the actual BPE work moves to <ModelWrapper as Model>::tokenize / <BPE as Model>::tokenize, where it is no longer wrapped in lock acquisitions.

Throughput on the same Python script, same 15 s wall-clock window:

  threads  metric                            before    after     change
  -------  ------                            ------    ------    ------
  1T       encode_batch calls completed         10        10     flat
           effective throughput              4.3 MiB/s 4.3 MiB/s  -- (BPE-merge bound)
  88T      encode_batch calls completed         61       146     +139 %
           time per call                     246 ms    103 ms    -58 %
           effective throughput             26.4 MiB/s 63.2 MiB/s +139 %

At 1T there is no win because tokenisation time is dominated by BPE merging itself; the LSE acquire/release pair is fast in the uncontended single-thread case. At 88T, removing ~3 000 unnecessary LSE atomic operations per document removes about 58 % of the wall-clock cost of encode_batch on this aarch64 host, which matches the observation that PyModel::tokenize dominated the pre-patch profile.

The Rust-only bpe_benchmark (which builds BPE directly via BpeBuilder and never goes through PyModel) is unaffected by this change: 3.93 -> 3.94 MiB/s at 1T and 17.97 -> 17.93 MiB/s at 88T, both within run-to-run noise.

The Python and Node bindings both wrap the inner model in
`Arc<RwLock<ModelWrapper>>`.  Their `Model::tokenize()` impls each do
`self.model.read().unwrap().tokenize(seq)`, so a single `read()` lock
acquisition per call is unavoidable -- but `TokenizerImpl::do_tokenize`
calls `tokenize()` once per pre-token, and a typical ~6 KB document
contains ~1 500 pre-tokens.  Each `RwLock::read` acquisition emits a
`ldadd4_acq_rel` on aarch64 (and a paired `ldadd4_rel` on the
matching `RwLockReadGuard` drop), so a single document costs a few
thousand LSE atomic operations that produce no useful work.

Add a default `Model::tokenize_in_pretokenized` trait method whose
default body is the existing per-pre-token loop.  Override it in
`PyModel` (Python binding) and `Model` (Node binding) so that both
bindings acquire the read lock once and tokenize every pre-token under
the same guard.  `TokenizerImpl::do_tokenize` calls the new method
instead of dispatching per pre-token, which makes the optimisation
transparent to all callers without changing what `Model::tokenize`
does for any in-tree or out-of-tree implementor.

The default implementation preserves the old behaviour byte-for-byte
for any `Model` that is not behind a `RwLock`, so adding the method
to the public trait is non-breaking: external implementors do not
need to override it.

cargo test --lib --features http: 201 passed, 0 failed.

Perf evidence on Vera (89 physical / 178 logical) with wheels built
`-Ctarget-feature=+lse,+rcpc` (so LSE atomics are inlined and do not
aggregate behind `__aarch64_*` outline helpers).  A Python process
calls `tokenizer.encode_batch(docs, false)` in a loop for 15 s, where
`docs` is `data/big.txt` (6.5 MB) split into 999 ~6.5 KB chunks.
`perf record -g --call-graph fp -F 4999`, RAYON_NUM_THREADS=88:

  symbol                                            before    after
  <PyModel as Model>::tokenize                      89.16%    -
  <PyModel as Model>::tokenize_in_pretokenized      -          0.02%
  <ModelWrapper as Model>::tokenize                  0.00%     0.02%
  <BPE as Model>::tokenize                           0.02%     0.13%

Before, 89 % of CPU cycles are inside `PyModel::tokenize`, which is
the wrapper that takes an `Arc<RwLock>::read` per pre-token, calls
the inner BPE, then drops the guard (one `RwLock` read-acquire +
release pair, i.e. two LSE atomics per pre-token, ~3 000 atomic
operations per document).  After, that wrapper is replaced by a
single call to `tokenize_in_pretokenized` which takes the lock once
for the whole pre-token sequence (0.02 %); the actual BPE work moves
to `<ModelWrapper as Model>::tokenize` / `<BPE as Model>::tokenize`,
where it is no longer wrapped in lock acquisitions.

Throughput on the same Python script, same 15 s wall-clock window:

  threads  metric                            before    after     change
  -------  ------                            ------    ------    ------
  1T       encode_batch calls completed         10        10     flat
           effective throughput              4.3 MiB/s 4.3 MiB/s  -- (BPE-merge bound)
  88T      encode_batch calls completed         61       146     +139 %
           time per call                     246 ms    103 ms    -58 %
           effective throughput             26.4 MiB/s 63.2 MiB/s +139 %

At 1T there is no win because tokenisation time is dominated by BPE
merging itself; the LSE acquire/release pair is fast in the
uncontended single-thread case.  At 88T, removing ~3 000 unnecessary
LSE atomic operations per document removes about 58 % of the
wall-clock cost of `encode_batch` on this aarch64 host, which matches
the observation that `PyModel::tokenize` dominated the pre-patch
profile.

The Rust-only `bpe_benchmark` (which builds `BPE` directly via
`BpeBuilder` and never goes through `PyModel`) is unaffected by this
change: 3.93 -> 3.94 MiB/s at 1T and 17.97 -> 17.93 MiB/s at 88T,
both within run-to-run noise.
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Member

@McPatate McPatate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we want to replicate the comments 3 times, perhaps best to add docs over at the call site instead? just an idea though.

There's also a pretokenized.tokenize_with_limit call that might benefit from this too in the codebase.

Finally, would be good to test on other platforms and archs.

pretokenized: &mut tk::tokenizer::PreTokenizedString,
) -> tk::Result<()> {
let model = self.model.as_ref().ok_or("Uninitialized Model")?;
let guard = model.read().unwrap();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to return an error rather than panic imo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants