fix(bpe): disambiguate legacy v1 merges with spaces in vocab tokens#2078
Closed
SMC17 wants to merge 1 commit into
Closed
fix(bpe): disambiguate legacy v1 merges with spaces in vocab tokens#2078SMC17 wants to merge 1 commit into
SMC17 wants to merge 1 commit into
Conversation
The legacy `tokenizer.json` v1 `merges: ["a b"]` string format and the
v2 `merges: [["a", "b"]]` array format are documented as equivalent,
but the v1 deserialiser used a strict `split(' ')` with a
`parts.len() != 2` check that rejected any merge whose tokens
contained an internal space. The v2 array form has no such issue, so
two `tokenizer.json` files describing the same BPE model failed to
deserialise interchangeably and the v1 form surfaced a cryptic
"Merges text file invalid at line 1" message.
This commit resolves the asymmetry by disambiguating multi-space v1
merge entries against the vocabulary at deserialise time. For each
candidate split position the deserialiser checks the triple
`(a, b, a ++ b)` against the vocabulary; if exactly one split
position satisfies all three, that split wins. Zero or multiple
matches surface explicit errors that name the offending entry and
suggest re-serialising as v2 array merges. Single-space lines, the
overwhelming common case, take an unchanged fast path so no
existing tokenizer changes behaviour.
The fix is reachable only through the legacy v1 JSON path; the
`merges.txt` file parser and the v2 array path are untouched.
Surfaced while porting tokenizers BPE to Zig at tokenizers-zig
(independent implementation, AGPL-3.0), where both v1 and v2 merge
formats had to be accepted from the same vocabulary fixtures. The
same disambiguation logic was needed on the Zig side.
Tests:
- test_legacy_v1_merges_with_space_in_token_round_trip: v1 and v2
forms of the same model deserialise to equal BPEs.
- test_legacy_v1_merges_single_space_fast_path_unchanged: the
common case retains its existing behaviour byte-for-byte.
- test_legacy_v1_merges_unresolvable_multi_space_errors_clearly:
vocabulary-incompatible multi-space lines surface the new
resolution error.
- test_legacy_v1_merges_ambiguous_multi_space_errors_clearly:
genuinely ambiguous lines are refused rather than silently
resolved.
- Existing models::tests::serialization updated to assert against
the new, more descriptive error message.
cargo test --lib: 204 passed (was 200; +4 new).
cargo clippy --lib --no-deps: clean.
Signed-off-by: Sean Collins <sean@sunlitmoon.online>
Assisted-by: claude-opus-4-7
Author
|
Sorry — been testing some things and this leaked out. Closing. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
BPE: disambiguate legacy v1 string-merge tokens that contain spaces
Summary
The legacy
tokenizer.jsonv1merges: ["a b"]format and the v2merges: [["a", "b"]]array format are documented as equivalentrepresentations of the same BPE model. They are not, today.
The v1 deserialiser in
tokenizers/src/models/bpe/serialization.rsforwards each
mergesstring intoconvert_merges_to_hashmap, whichperforms a strict
split(' ')with aparts.len() != 2check. Anymerge whose tokens contain an internal ASCII space fails with the
message
Merges text file invalid at line 1. The v2 array formathandles those tokens correctly. The result is that two
tokenizer.jsonfiles describing the same BPE model can disagree on whether they
deserialise.
This PR resolves the asymmetry on the legacy v1 path only. Single-space
lines (the overwhelmingly common case) keep their existing fast path
byte-for-byte; multi-space lines are disambiguated against the
vocabulary.
Substrate-derived rationale
This came out of porting tokenizers BPE to Zig in an independent
implementation (AGPL-3.0). The port had to accept both v1 string
merges and v2 array merges from the same vocabulary fixtures.
Specifically the existing serialisation test at
serialization.rs:194constructs a vocabulary with the token"b c d"and round-trips through the v2 array form. Doing theequivalent round-trip through the v1 string form fails today with
Merges text file invalid at line 1 at line 1 column 236. The fixhere is the same disambiguation logic that the Zig port uses, ported
back to upstream Rust.
The fix is scoped to the JSON deserialiser only. The
merges.txtfile-format parser (
BPE::read_file->convert_merges_to_hashmap)is unchanged because the file format genuinely has a single ASCII
space as the wire separator between the two tokens; v1 JSON merges
share that ambiguity in principle but are recoverable in practice
because the vocabulary is co-located in the same JSON document.
Approach
For each multi-space v1 merge entry, the deserialiser tries every
ASCII-space cut position and keeps the one for which:
suggesting re-serialise as v2 array merges.
picking one.
Single-space lines (no ambiguity possible) take the existing fast
path so no tokenizer whose vocabulary contains no space-containing
tokens changes behaviour at all.
Tests
Four new tests in
models::bpe::serialization::test:test_legacy_v1_merges_with_space_in_token_round_trip-asserts that the v1 string form and the v2 array form of the same
BPE deserialise to equal
BPEvalues when the vocabulary containsa space-bearing token.
test_legacy_v1_merges_single_space_fast_path_unchanged-asserts that the common case still resolves the expected pair
identifiers (regression guard on the fast path).
test_legacy_v1_merges_unresolvable_multi_space_errors_clearly-asserts that vocab-incompatible multi-space lines surface the new
"could not be resolved against the vocabulary" error.
test_legacy_v1_merges_ambiguous_multi_space_errors_clearly-asserts that genuinely ambiguous lines (more than one valid
candidate split) are refused rather than silently resolved.
One existing test updated:
models::tests::serializationwas previously asserting against thecryptic
Merges text file invalid at line 1message. It is nowasserting against the new "could not be resolved against the
vocabulary" message. The intent of the test (invalid multi-space
legacy line yields an error) is preserved.
CI verification
Local run on the diff:
The integration suite under
tokenizers/tests/requires fixturedownloads via
make test(GPT-2 + BERT vocab/merges); those testsfail without the fixtures regardless of this diff. The diff does not
touch any code that those tests exercise.
Compatibility
No public API change. Wire format unchanged. The only observable
change is:
tokenizer.jsonfiles containing tokens with internalspaces now deserialise instead of failing.
error with a more descriptive message that names the entry and
rank, and suggests the v2 form.