-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Description
build_sentencepiece_luts (train_gpt.py:180) tries to estimate how many UTF-8 bytes each token corresponds to.
For normal tokens, it removes the leading ▁, treats that as a single space byte, and then counts the UTF-8 bytes in the rest of the token string. For byte-fallback tokens (sp.is_byte()), it just counts 1 byte per token.
The issue shows up when ▁ (U+2581) is not present as its own token. In that case, SentencePiece encodes it as three separate byte tokens: <0xE2>, <0x96>, and <0x81>, since ▁ is three bytes in UTF-8. Because each of those fallback tokens is counted as 1 byte, the code ends up treating ▁ as 3 bytes.
But in practice, ▁ is only used to represent a single ASCII space (0x20), which is just 1 byte. So every word boundary gets overcounted by 2 bytes.
That makes val_byte_count too large, which then makes tokens_per_byte look smaller than it really is and pushes BPB artificially low. I have run a few tests, we are talking roughly about a 20% decrease in reported bpb compared to the true one.
The standard fineweb_1024_bpe.model does include ▁ as token 939, so the normal stripping logic handles it correctly and this edge case never happens. It only appears with a custom SentencePiece model that does not contain a standalone ▁ token.
Hope this helps.
Riccardo