Skip to content
Discussion options

You must be logged in to vote

Sebastian, thank you!!!
I added the try/exception code that you suggested into my 'calc_loss_batch' function and sure enough
I could see token IDs much larger than 5000.
That caused me to print out all the BPE tokenizer.vocab list and inspect the token IDs. They all looked great, nothing too large.
So following the code path, if the tokenizer was creating good token IDs, then the next place to look was the
'create_dataloader_v1' function --- and there it was, the problem:

def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128,
                         shuffle=True, drop_last=True, num_workers=0):
        tokenizer = tiktoken.get_encoding("gpt2")
        dataset = GPTDatase…

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@tmrogo
Comment options

Answer selected by rasbt
@rasbt
Comment options

rasbt Apr 5, 2026
Maintainer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants