BUG: bpb underestimated when tokenizer does not contain U+2581 (ie the space) token

build_sentencepiece_luts (train_gpt.py:180) tries to estimate how many UTF-8 bytes each token corresponds to.

For normal tokens, it removes the leading ▁, treats that as a single space byte, and then counts the UTF-8 bytes in the rest of the token string. For byte-fallback tokens (sp.is_byte()), it just counts 1 byte per token.

The issue shows up when ▁ (U+2581) is not present as its own token. In that case, SentencePiece encodes it as three separate byte tokens: <0xE2>, <0x96>, and <0x81>, since ▁ is three bytes in UTF-8. Because each of those fallback tokens is counted as 1 byte, the code ends up treating ▁ as 3 bytes.

But in practice, ▁ is only used to represent a single ASCII space (0x20), which is just 1 byte. So every word boundary gets overcounted by 2 bytes.

That makes val_byte_count too large, which then makes tokens_per_byte look smaller than it really is and pushes BPB artificially low. I have run a few tests, we are talking roughly about a 20% decrease in reported bpb compared to the true one.

The standard fineweb_1024_bpe.model does include ▁ as token 939, so the normal stripping logic handles it correctly and this edge case never happens. It only appears with a custom SentencePiece model that does not contain a standalone ▁ token.

Hope this helps.
Riccardo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: bpb underestimated when tokenizer does not contain U+2581 (ie the space) token #897

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BUG: bpb underestimated when tokenizer does not contain U+2581 (ie the space) token #897

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions