Skip to content

BUG: bpb underestimated when tokenizer does not contain U+2581 (ie the space) token #897

@riccardoalberghi

Description

@riccardoalberghi

build_sentencepiece_luts (train_gpt.py:180) tries to estimate how many UTF-8 bytes each token corresponds to.

For normal tokens, it removes the leading ▁, treats that as a single space byte, and then counts the UTF-8 bytes in the rest of the token string. For byte-fallback tokens (sp.is_byte()), it just counts 1 byte per token.

The issue shows up when ▁ (U+2581) is not present as its own token. In that case, SentencePiece encodes it as three separate byte tokens: <0xE2>, <0x96>, and <0x81>, since ▁ is three bytes in UTF-8. Because each of those fallback tokens is counted as 1 byte, the code ends up treating ▁ as 3 bytes.

But in practice, ▁ is only used to represent a single ASCII space (0x20), which is just 1 byte. So every word boundary gets overcounted by 2 bytes.

That makes val_byte_count too large, which then makes tokens_per_byte look smaller than it really is and pushes BPB artificially low. I have run a few tests, we are talking roughly about a 20% decrease in reported bpb compared to the true one.

The standard fineweb_1024_bpe.model does include ▁ as token 939, so the normal stripping logic handles it correctly and this edge case never happens. It only appears with a custom SentencePiece model that does not contain a standalone ▁ token.

Hope this helps.
Riccardo

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions