fix(bpe): disambiguate legacy v1 merges with spaces in vocab tokens by SMC17 · Pull Request #2078 · huggingface/tokenizers

SMC17 · 2026-05-29T00:20:55Z

BPE: disambiguate legacy v1 string-merge tokens that contain spaces

Summary

The legacy tokenizer.json v1 merges: ["a b"] format and the v2
merges: [["a", "b"]] array format are documented as equivalent
representations of the same BPE model. They are not, today.

The v1 deserialiser in tokenizers/src/models/bpe/serialization.rs
forwards each merges string into convert_merges_to_hashmap, which
performs a strict split(' ') with a parts.len() != 2 check. Any
merge whose tokens contain an internal ASCII space fails with the
message Merges text file invalid at line 1. The v2 array format
handles those tokens correctly. The result is that two tokenizer.json
files describing the same BPE model can disagree on whether they
deserialise.

This PR resolves the asymmetry on the legacy v1 path only. Single-space
lines (the overwhelmingly common case) keep their existing fast path
byte-for-byte; multi-space lines are disambiguated against the
vocabulary.

Substrate-derived rationale

This came out of porting tokenizers BPE to Zig in an independent
implementation (AGPL-3.0). The port had to accept both v1 string
merges and v2 array merges from the same vocabulary fixtures.

Specifically the existing serialisation test at
serialization.rs:194 constructs a vocabulary with the token
"b c d" and round-trips through the v2 array form. Doing the
equivalent round-trip through the v1 string form fails today with
Merges text file invalid at line 1 at line 1 column 236. The fix
here is the same disambiguation logic that the Zig port uses, ported
back to upstream Rust.

The fix is scoped to the JSON deserialiser only. The merges.txt
file-format parser (BPE::read_file -> convert_merges_to_hashmap)
is unchanged because the file format genuinely has a single ASCII
space as the wire separator between the two tokens; v1 JSON merges
share that ambiguity in principle but are recoverable in practice
because the vocabulary is co-located in the same JSON document.

Approach

For each multi-space v1 merge entry, the deserialiser tries every
ASCII-space cut position and keeps the one for which:

vocab.contains_key(a) && vocab.contains_key(b) && vocab.contains_key(a ++ b)

Exactly one match: that split wins.
Zero matches: clear error naming the offending entry and rank,
suggesting re-serialise as v2 array merges.
Multiple matches: explicit ambiguity error rather than silently
picking one.

Single-space lines (no ambiguity possible) take the existing fast
path so no tokenizer whose vocabulary contains no space-containing
tokens changes behaviour at all.

Tests

Four new tests in models::bpe::serialization::test:

test_legacy_v1_merges_with_space_in_token_round_trip -
asserts that the v1 string form and the v2 array form of the same
BPE deserialise to equal BPE values when the vocabulary contains
a space-bearing token.
test_legacy_v1_merges_single_space_fast_path_unchanged -
asserts that the common case still resolves the expected pair
identifiers (regression guard on the fast path).
test_legacy_v1_merges_unresolvable_multi_space_errors_clearly -
asserts that vocab-incompatible multi-space lines surface the new
"could not be resolved against the vocabulary" error.
test_legacy_v1_merges_ambiguous_multi_space_errors_clearly -
asserts that genuinely ambiguous lines (more than one valid
candidate split) are refused rather than silently resolved.

One existing test updated:

models::tests::serialization was previously asserting against the
cryptic Merges text file invalid at line 1 message. It is now
asserting against the new "could not be resolved against the
vocabulary" message. The intent of the test (invalid multi-space
legacy line yields an error) is preserved.

CI verification

Local run on the diff:

cargo test --lib                  -> 204 passed (was 200; +4 new)
cargo test --lib models::bpe      -> 24 passed
cargo clippy --lib --no-deps      -> clean
cargo build --lib                 -> clean

The integration suite under tokenizers/tests/ requires fixture
downloads via make test (GPT-2 + BERT vocab/merges); those tests
fail without the fixtures regardless of this diff. The diff does not
touch any code that those tests exercise.

Compatibility

No public API change. Wire format unchanged. The only observable
change is:

Legacy v1 tokenizer.json files containing tokens with internal
spaces now deserialise instead of failing.
Multi-space v1 merge entries that are vocab-incompatible now
error with a more descriptive message that names the entry and
rank, and suggests the v2 form.

The legacy `tokenizer.json` v1 `merges: ["a b"]` string format and the v2 `merges: [["a", "b"]]` array format are documented as equivalent, but the v1 deserialiser used a strict `split(' ')` with a `parts.len() != 2` check that rejected any merge whose tokens contained an internal space. The v2 array form has no such issue, so two `tokenizer.json` files describing the same BPE model failed to deserialise interchangeably and the v1 form surfaced a cryptic "Merges text file invalid at line 1" message. This commit resolves the asymmetry by disambiguating multi-space v1 merge entries against the vocabulary at deserialise time. For each candidate split position the deserialiser checks the triple `(a, b, a ++ b)` against the vocabulary; if exactly one split position satisfies all three, that split wins. Zero or multiple matches surface explicit errors that name the offending entry and suggest re-serialising as v2 array merges. Single-space lines, the overwhelming common case, take an unchanged fast path so no existing tokenizer changes behaviour. The fix is reachable only through the legacy v1 JSON path; the `merges.txt` file parser and the v2 array path are untouched. Surfaced while porting tokenizers BPE to Zig at tokenizers-zig (independent implementation, AGPL-3.0), where both v1 and v2 merge formats had to be accepted from the same vocabulary fixtures. The same disambiguation logic was needed on the Zig side. Tests: - test_legacy_v1_merges_with_space_in_token_round_trip: v1 and v2 forms of the same model deserialise to equal BPEs. - test_legacy_v1_merges_single_space_fast_path_unchanged: the common case retains its existing behaviour byte-for-byte. - test_legacy_v1_merges_unresolvable_multi_space_errors_clearly: vocabulary-incompatible multi-space lines surface the new resolution error. - test_legacy_v1_merges_ambiguous_multi_space_errors_clearly: genuinely ambiguous lines are refused rather than silently resolved. - Existing models::tests::serialization updated to assert against the new, more descriptive error message. cargo test --lib: 204 passed (was 200; +4 new). cargo clippy --lib --no-deps: clean. Signed-off-by: Sean Collins <sean@sunlitmoon.online> Assisted-by: claude-opus-4-7

SMC17 · 2026-05-29T12:11:14Z

Sorry — been testing some things and this leaked out. Closing.

SMC17 closed this May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bpe): disambiguate legacy v1 merges with spaces in vocab tokens#2078

fix(bpe): disambiguate legacy v1 merges with spaces in vocab tokens#2078
SMC17 wants to merge 1 commit into
huggingface:mainfrom
SMC17:bpe-legacy-v1-merges-with-spaces

SMC17 commented May 29, 2026

Uh oh!

SMC17 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SMC17 commented May 29, 2026

BPE: disambiguate legacy v1 string-merge tokens that contain spaces

Summary

Substrate-derived rationale

Approach

Tests

CI verification

Compatibility

Uh oh!

SMC17 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant