-
|
Background: After training my BPE tokenizer, I get the following stats on tokenizer = BPETokenizer() print(f"len(tokenizer.vocab) = {len(tokenizer.vocab)}") the len(tokenizer.vocab) is 5000; Next after training the BPE tokenizer, I set up my GPT-2 configuration, using vocab_size 4999 : GPT_CONFIG_124M = { Next I create the data loaders, and create the mode: model = GPTModel(GPT_CONFIG_124M) print(model) Next I try to pre-train the GPT model with my trained BPE tokenizer: n_epochs = 15 train_losses, val_losses, tokens_seen, track_lrs = train_model(model, and then the training fails with the following trace: Traceback (most recent call last): I re-ran the same code but with tiktoken tokenizer and only changed GPT_CONFIG_124M = { If anyone can help me understand what the issue is with the trained Thanks for any help,
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
|
Hi there, it seems like the tokenizer may still produce token ID's that are bigger than the vocabulary size in your case. try:
logits = model(input_batch.to(device))
except Exception:
print("input ids:", input_batch.tolist())
raiseto see what's going on with the IDs. |
Beta Was this translation helpful? Give feedback.
Sebastian, thank you!!!
I added the try/exception code that you suggested into my 'calc_loss_batch' function and sure enough
I could see token IDs much larger than 5000.
That caused me to print out all the BPE tokenizer.vocab list and inspect the token IDs. They all looked great, nothing too large.
So following the code path, if the tokenizer was creating good token IDs, then the next place to look was the
'create_dataloader_v1' function --- and there it was, the problem: