Bindings: acquire model read-lock once per call instead of per pre-token#2072
Open
sebpop wants to merge 1 commit into
Open
Bindings: acquire model read-lock once per call instead of per pre-token#2072sebpop wants to merge 1 commit into
sebpop wants to merge 1 commit into
Conversation
The Python and Node bindings both wrap the inner model in
`Arc<RwLock<ModelWrapper>>`. Their `Model::tokenize()` impls each do
`self.model.read().unwrap().tokenize(seq)`, so a single `read()` lock
acquisition per call is unavoidable -- but `TokenizerImpl::do_tokenize`
calls `tokenize()` once per pre-token, and a typical ~6 KB document
contains ~1 500 pre-tokens. Each `RwLock::read` acquisition emits a
`ldadd4_acq_rel` on aarch64 (and a paired `ldadd4_rel` on the
matching `RwLockReadGuard` drop), so a single document costs a few
thousand LSE atomic operations that produce no useful work.
Add a default `Model::tokenize_in_pretokenized` trait method whose
default body is the existing per-pre-token loop. Override it in
`PyModel` (Python binding) and `Model` (Node binding) so that both
bindings acquire the read lock once and tokenize every pre-token under
the same guard. `TokenizerImpl::do_tokenize` calls the new method
instead of dispatching per pre-token, which makes the optimisation
transparent to all callers without changing what `Model::tokenize`
does for any in-tree or out-of-tree implementor.
The default implementation preserves the old behaviour byte-for-byte
for any `Model` that is not behind a `RwLock`, so adding the method
to the public trait is non-breaking: external implementors do not
need to override it.
cargo test --lib --features http: 201 passed, 0 failed.
Perf evidence on Vera (89 physical / 178 logical) with wheels built
`-Ctarget-feature=+lse,+rcpc` (so LSE atomics are inlined and do not
aggregate behind `__aarch64_*` outline helpers). A Python process
calls `tokenizer.encode_batch(docs, false)` in a loop for 15 s, where
`docs` is `data/big.txt` (6.5 MB) split into 999 ~6.5 KB chunks.
`perf record -g --call-graph fp -F 4999`, RAYON_NUM_THREADS=88:
symbol before after
<PyModel as Model>::tokenize 89.16% -
<PyModel as Model>::tokenize_in_pretokenized - 0.02%
<ModelWrapper as Model>::tokenize 0.00% 0.02%
<BPE as Model>::tokenize 0.02% 0.13%
Before, 89 % of CPU cycles are inside `PyModel::tokenize`, which is
the wrapper that takes an `Arc<RwLock>::read` per pre-token, calls
the inner BPE, then drops the guard (one `RwLock` read-acquire +
release pair, i.e. two LSE atomics per pre-token, ~3 000 atomic
operations per document). After, that wrapper is replaced by a
single call to `tokenize_in_pretokenized` which takes the lock once
for the whole pre-token sequence (0.02 %); the actual BPE work moves
to `<ModelWrapper as Model>::tokenize` / `<BPE as Model>::tokenize`,
where it is no longer wrapped in lock acquisitions.
Throughput on the same Python script, same 15 s wall-clock window:
threads metric before after change
------- ------ ------ ------ ------
1T encode_batch calls completed 10 10 flat
effective throughput 4.3 MiB/s 4.3 MiB/s -- (BPE-merge bound)
88T encode_batch calls completed 61 146 +139 %
time per call 246 ms 103 ms -58 %
effective throughput 26.4 MiB/s 63.2 MiB/s +139 %
At 1T there is no win because tokenisation time is dominated by BPE
merging itself; the LSE acquire/release pair is fast in the
uncontended single-thread case. At 88T, removing ~3 000 unnecessary
LSE atomic operations per document removes about 58 % of the
wall-clock cost of `encode_batch` on this aarch64 host, which matches
the observation that `PyModel::tokenize` dominated the pre-patch
profile.
The Rust-only `bpe_benchmark` (which builds `BPE` directly via
`BpeBuilder` and never goes through `PyModel`) is unaffected by this
change: 3.93 -> 3.94 MiB/s at 1T and 17.97 -> 17.93 MiB/s at 88T,
both within run-to-run noise.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
McPatate
reviewed
May 28, 2026
Member
McPatate
left a comment
There was a problem hiding this comment.
Not sure we want to replicate the comments 3 times, perhaps best to add docs over at the call site instead? just an idea though.
There's also a pretokenized.tokenize_with_limit call that might benefit from this too in the codebase.
Finally, would be good to test on other platforms and archs.
| pretokenized: &mut tk::tokenizer::PreTokenizedString, | ||
| ) -> tk::Result<()> { | ||
| let model = self.model.as_ref().ok_or("Uninitialized Model")?; | ||
| let guard = model.read().unwrap(); |
Member
There was a problem hiding this comment.
Would be nice to return an error rather than panic imo
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The Python and Node bindings both wrap the inner model in
Arc<RwLock<ModelWrapper>>. TheirModel::tokenize()impls each doself.model.read().unwrap().tokenize(seq), so a singleread()lock acquisition per call is unavoidable -- butTokenizerImpl::do_tokenizecallstokenize()once per pre-token, and a typical ~6 KB document contains ~1 500 pre-tokens. EachRwLock::readacquisition emits aldadd4_acq_relon aarch64 (and a pairedldadd4_relon the matchingRwLockReadGuarddrop), so a single document costs a few thousand LSE atomic operations that produce no useful work.Add a default
Model::tokenize_in_pretokenizedtrait method whose default body is the existing per-pre-token loop. Override it inPyModel(Python binding) andModel(Node binding) so that both bindings acquire the read lock once and tokenize every pre-token under the same guard.TokenizerImpl::do_tokenizecalls the new method instead of dispatching per pre-token, which makes the optimisation transparent to all callers without changing whatModel::tokenizedoes for any in-tree or out-of-tree implementor.The default implementation preserves the old behaviour byte-for-byte for any
Modelthat is not behind aRwLock, so adding the method to the public trait is non-breaking: external implementors do not need to override it.Perf evidence on Vera (88 physical / 176 logical) with wheels built
-Ctarget-feature=+lse,+rcpc(so LSE atomics are inlined and do not aggregate behind__aarch64_*outline helpers). A Python process callstokenizer.encode_batch(docs, false)in a loop for 15 s, wheredocsisdata/big.txt(6.5 MB) split into 999 ~6.5 KB chunks.perf record -g --call-graph fp -F 4999, RAYON_NUM_THREADS=88:Before, 89 % of CPU cycles are inside
PyModel::tokenize, which is the wrapper that takes anArc<RwLock>::readper pre-token, calls the inner BPE, then drops the guard (oneRwLockread-acquire + release pair, i.e. two LSE atomics per pre-token, ~3 000 atomic operations per document). After, that wrapper is replaced by a single call totokenize_in_pretokenizedwhich takes the lock once for the whole pre-token sequence (0.02 %); the actual BPE work moves to<ModelWrapper as Model>::tokenize/<BPE as Model>::tokenize, where it is no longer wrapped in lock acquisitions.Throughput on the same Python script, same 15 s wall-clock window:
At 1T there is no win because tokenisation time is dominated by BPE merging itself; the LSE acquire/release pair is fast in the uncontended single-thread case. At 88T, removing ~3 000 unnecessary LSE atomic operations per document removes about 58 % of the wall-clock cost of
encode_batchon this aarch64 host, which matches the observation thatPyModel::tokenizedominated the pre-patch profile.The Rust-only
bpe_benchmark(which buildsBPEdirectly viaBpeBuilderand never goes throughPyModel) is unaffected by this change: 3.93 -> 3.94 MiB/s at 1T and 17.97 -> 17.93 MiB/s at 88T, both within run-to-run noise.