Skip to content

fix(bpe): disambiguate legacy v1 merges with spaces in vocab tokens#2078

Closed
SMC17 wants to merge 1 commit into
huggingface:mainfrom
SMC17:bpe-legacy-v1-merges-with-spaces
Closed

fix(bpe): disambiguate legacy v1 merges with spaces in vocab tokens#2078
SMC17 wants to merge 1 commit into
huggingface:mainfrom
SMC17:bpe-legacy-v1-merges-with-spaces

Conversation

@SMC17
Copy link
Copy Markdown

@SMC17 SMC17 commented May 29, 2026

BPE: disambiguate legacy v1 string-merge tokens that contain spaces

Summary

The legacy tokenizer.json v1 merges: ["a b"] format and the v2
merges: [["a", "b"]] array format are documented as equivalent
representations of the same BPE model. They are not, today.

The v1 deserialiser in tokenizers/src/models/bpe/serialization.rs
forwards each merges string into convert_merges_to_hashmap, which
performs a strict split(' ') with a parts.len() != 2 check. Any
merge whose tokens contain an internal ASCII space fails with the
message Merges text file invalid at line 1. The v2 array format
handles those tokens correctly. The result is that two tokenizer.json
files describing the same BPE model can disagree on whether they
deserialise.

This PR resolves the asymmetry on the legacy v1 path only. Single-space
lines (the overwhelmingly common case) keep their existing fast path
byte-for-byte; multi-space lines are disambiguated against the
vocabulary.

Substrate-derived rationale

This came out of porting tokenizers BPE to Zig in an independent
implementation (AGPL-3.0). The port had to accept both v1 string
merges and v2 array merges from the same vocabulary fixtures.

Specifically the existing serialisation test at
serialization.rs:194 constructs a vocabulary with the token
"b c d" and round-trips through the v2 array form. Doing the
equivalent round-trip through the v1 string form fails today with
Merges text file invalid at line 1 at line 1 column 236. The fix
here is the same disambiguation logic that the Zig port uses, ported
back to upstream Rust.

The fix is scoped to the JSON deserialiser only. The merges.txt
file-format parser (BPE::read_file -> convert_merges_to_hashmap)
is unchanged because the file format genuinely has a single ASCII
space as the wire separator between the two tokens; v1 JSON merges
share that ambiguity in principle but are recoverable in practice
because the vocabulary is co-located in the same JSON document.

Approach

For each multi-space v1 merge entry, the deserialiser tries every
ASCII-space cut position and keeps the one for which:

vocab.contains_key(a) && vocab.contains_key(b) && vocab.contains_key(a ++ b)
  • Exactly one match: that split wins.
  • Zero matches: clear error naming the offending entry and rank,
    suggesting re-serialise as v2 array merges.
  • Multiple matches: explicit ambiguity error rather than silently
    picking one.

Single-space lines (no ambiguity possible) take the existing fast
path so no tokenizer whose vocabulary contains no space-containing
tokens changes behaviour at all.

Tests

Four new tests in models::bpe::serialization::test:

  • test_legacy_v1_merges_with_space_in_token_round_trip -
    asserts that the v1 string form and the v2 array form of the same
    BPE deserialise to equal BPE values when the vocabulary contains
    a space-bearing token.
  • test_legacy_v1_merges_single_space_fast_path_unchanged -
    asserts that the common case still resolves the expected pair
    identifiers (regression guard on the fast path).
  • test_legacy_v1_merges_unresolvable_multi_space_errors_clearly -
    asserts that vocab-incompatible multi-space lines surface the new
    "could not be resolved against the vocabulary" error.
  • test_legacy_v1_merges_ambiguous_multi_space_errors_clearly -
    asserts that genuinely ambiguous lines (more than one valid
    candidate split) are refused rather than silently resolved.

One existing test updated:

  • models::tests::serialization was previously asserting against the
    cryptic Merges text file invalid at line 1 message. It is now
    asserting against the new "could not be resolved against the
    vocabulary" message. The intent of the test (invalid multi-space
    legacy line yields an error) is preserved.

CI verification

Local run on the diff:

cargo test --lib                  -> 204 passed (was 200; +4 new)
cargo test --lib models::bpe      -> 24 passed
cargo clippy --lib --no-deps      -> clean
cargo build --lib                 -> clean

The integration suite under tokenizers/tests/ requires fixture
downloads via make test (GPT-2 + BERT vocab/merges); those tests
fail without the fixtures regardless of this diff. The diff does not
touch any code that those tests exercise.

Compatibility

No public API change. Wire format unchanged. The only observable
change is:

  1. Legacy v1 tokenizer.json files containing tokens with internal
    spaces now deserialise instead of failing.
  2. Multi-space v1 merge entries that are vocab-incompatible now
    error with a more descriptive message that names the entry and
    rank, and suggests the v2 form.

The legacy `tokenizer.json` v1 `merges: ["a b"]` string format and the
v2 `merges: [["a", "b"]]` array format are documented as equivalent,
but the v1 deserialiser used a strict `split(' ')` with a
`parts.len() != 2` check that rejected any merge whose tokens
contained an internal space. The v2 array form has no such issue, so
two `tokenizer.json` files describing the same BPE model failed to
deserialise interchangeably and the v1 form surfaced a cryptic
"Merges text file invalid at line 1" message.

This commit resolves the asymmetry by disambiguating multi-space v1
merge entries against the vocabulary at deserialise time. For each
candidate split position the deserialiser checks the triple
`(a, b, a ++ b)` against the vocabulary; if exactly one split
position satisfies all three, that split wins. Zero or multiple
matches surface explicit errors that name the offending entry and
suggest re-serialising as v2 array merges. Single-space lines, the
overwhelming common case, take an unchanged fast path so no
existing tokenizer changes behaviour.

The fix is reachable only through the legacy v1 JSON path; the
`merges.txt` file parser and the v2 array path are untouched.

Surfaced while porting tokenizers BPE to Zig at tokenizers-zig
(independent implementation, AGPL-3.0), where both v1 and v2 merge
formats had to be accepted from the same vocabulary fixtures. The
same disambiguation logic was needed on the Zig side.

Tests:
- test_legacy_v1_merges_with_space_in_token_round_trip: v1 and v2
  forms of the same model deserialise to equal BPEs.
- test_legacy_v1_merges_single_space_fast_path_unchanged: the
  common case retains its existing behaviour byte-for-byte.
- test_legacy_v1_merges_unresolvable_multi_space_errors_clearly:
  vocabulary-incompatible multi-space lines surface the new
  resolution error.
- test_legacy_v1_merges_ambiguous_multi_space_errors_clearly:
  genuinely ambiguous lines are refused rather than silently
  resolved.
- Existing models::tests::serialization updated to assert against
  the new, more descriptive error message.

cargo test --lib: 204 passed (was 200; +4 new).
cargo clippy --lib --no-deps: clean.

Signed-off-by: Sean Collins <sean@sunlitmoon.online>
Assisted-by: claude-opus-4-7
@SMC17
Copy link
Copy Markdown
Author

SMC17 commented May 29, 2026

Sorry — been testing some things and this leaked out. Closing.

@SMC17 SMC17 closed this May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant