Please help: "IndexError: index out of range in self..." while LLM training with a trained BPE tokenizer, #997

tmrogo · 2026-04-04T13:34:57Z

tmrogo
Apr 4, 2026

Background:
I have used the BPE Tokenizer from chapter 2 bonus materials and
trained it on a small Latin dataset. Compared to tiktoken, which
has been trained on mostly English text, this BPE tokenizer
is performing very well. The next step is to integrate the
trained BPE tokenizer into my GPT-2 LLM, which seems to
hinge on setting the 'vocab_size' of the LLM configuration to
exactly match the vocabulary size of the trained BPE tokenizer.

After training my BPE tokenizer, I get the following stats on
vocabulary size:

tokenizer = BPETokenizer()
tokenizer.train(latin_corpus, vocab_size=5000,
allowed_special={"<|endoftext|>"})

print(f"len(tokenizer.vocab) = {len(tokenizer.vocab)}")
print(f"len(tokenizer.bpe_merges) = {len(tokenizer.bpe_merges)}")
true_vocab_size = 256 + 1 + len(tokenizer.bpe_merges)
print(f"True vocab_size = 256 + 1 endoftext + #merges = {true_vocab_size}")

the len(tokenizer.vocab) is 5000;
the len(tokenizer.bpe_merges) is 4742
true_vocab_size = 256 + 1 + 4742 = 4999

Next after training the BPE tokenizer, I set up my GPT-2 configuration, using vocab_size 4999 :

GPT_CONFIG_124M = {
"vocab_size": 4999, # Vocabulary size
"context_length": 1024, # Context length (default: 1024)
"emb_dim": 768, # Embedding dimension
"n_heads": 12, # Number of attention heads
"n_layers": 12, # Number of layers
"drop_rate": 0.1, # Dropout rate
"qkv_bias": False # Query-key-value bias
}

Next I create the data loaders, and create the mode:

model = GPTModel(GPT_CONFIG_124M)
model.to(device)

print(model)

Next I try to pre-train the GPT model with my trained BPE tokenizer:

n_epochs = 15

train_losses, val_losses, tokens_seen, track_lrs = train_model(model,
train_loader, val_loader, optimizer, device, n_epochs=n_epochs,
eval_freq=5, eval_iter=5, start_context="Marcus filius",
tokenizer=tokenizer, warmup_steps=warmup_steps,
initial_lr=1e-5, min_lr=1e-5)

and then the training fails with the following trace:

Traceback (most recent call last):
File ~/projects/LLMs-from-scratch-main/exp3.py:154
train_losses, val_losses, tokens_seen, track_lrs = train_model(model,
File ~/projects/LLMs-from-scratch-main/llmlib.py:377 in train_model
loss = calc_loss_batch(input_batch, target_batch, model, device)
File ~/projects/LLMs-from-scratch-main/llmlib.py:254 in calc_loss_batch
logits = model(input_batch)
File ~/projects/LLMs-from-scratch-main/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1751 in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File ~/projects/LLMs-from-scratch-main/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1762 in _call_impl
return forward_call(*args, **kwargs)
File ~/projects/LLMs-from-scratch-main/llmlib.py:213 in forward
tok_embeds = self.tok_emb(in_idx)
File ~/projects/LLMs-from-scratch-main/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1751 in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File ~/projects/LLMs-from-scratch-main/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1762 in _call_impl
return forward_call(*args, **kwargs)
File ~/projects/LLMs-from-scratch-main/.venv/lib/python3.11/site-packages/torch/nn/modules/sparse.py:190 in forward
return F.embedding(
File ~/projects/LLMs-from-scratch-main/.venv/lib/python3.11/site-packages/torch/nn/functional.py:2551 in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

I re-ran the same code but with tiktoken tokenizer and only changed
the vocab_size of the LLM to 50257, and the code works and the model
trains without the 'index out of range in self' error.

GPT_CONFIG_124M = {
"vocab_size": 50257, # Vocabulary size

If anyone can help me understand what the issue is with the trained
BPE tokenizer I would really appreciate it. I still believe that
using the trained BPE tokenizer with its smaller vocab size of 5000
would enable the GPT-2 LLM to pretrain much more effectively since
the smaller vocab_size means fewer subword tokens to handle in the
LLM transformers.

Thanks for any help,

tmrogo

Answered by tmrogo

Apr 5, 2026

Sebastian, thank you!!!
I added the try/exception code that you suggested into my 'calc_loss_batch' function and sure enough
I could see token IDs much larger than 5000.
That caused me to print out all the BPE tokenizer.vocab list and inspect the token IDs. They all looked great, nothing too large.
So following the code path, if the tokenizer was creating good token IDs, then the next place to look was the
'create_dataloader_v1' function --- and there it was, the problem:

def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128,
                         shuffle=True, drop_last=True, num_workers=0):
        tokenizer = tiktoken.get_encoding("gpt2")
        dataset = GPTDatase…

View full answer

rasbt · 2026-04-04T16:42:55Z

rasbt
Apr 4, 2026
Maintainer

Hi there, it seems like the tokenizer may still produce token ID's that are bigger than the vocabulary size in your case.
I would print the token IDs in the case where it fails. Maybe add something like

try:
    logits = model(input_batch.to(device))
except Exception:
    print("input ids:", input_batch.tolist())
    raise

to see what's going on with the IDs.

2 replies

tmrogo Apr 5, 2026
Author

Sebastian, thank you!!!
I added the try/exception code that you suggested into my 'calc_loss_batch' function and sure enough
I could see token IDs much larger than 5000.
That caused me to print out all the BPE tokenizer.vocab list and inspect the token IDs. They all looked great, nothing too large.
So following the code path, if the tokenizer was creating good token IDs, then the next place to look was the
'create_dataloader_v1' function --- and there it was, the problem:

def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128,
                         shuffle=True, drop_last=True, num_workers=0):
        tokenizer = tiktoken.get_encoding("gpt2")
        dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
        dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle,
                           drop_last=drop_last, num_workers=num_workers)
        return dataloader

The 'create_dataloader_v1()' that I was using had its own tiktoken tokenizer!! This was the producer of token IDs > my
vocab_size. So I made a version of 'create_dataloader_v1' which accepts 'tokenizer' as an input argument,
so now I can pass in the trained BPE tokenizer into 'create_dataloader_v1' and now everything works!!!!
Thank you so much for helping me and thank you for writing such an awesome book!!! Before I found
your book, I did not know anything about LLMs. I have learned so much, I have completed all the
chapters and appendices in your book, several chapters I have read multiple times. I can really
appreciate all the time and effort that went into writing such an awesome book. Thank you.

tmrogo

Answer selected by rasbt

rasbt Apr 5, 2026
Maintainer

Awesome, glad you found the bug. And thanks so much for the very kind words about my book, that's super nice to hear!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please help: "IndexError: index out of range in self..." while LLM training with a trained BPE tokenizer, #997

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Please help: "IndexError: index out of range in self..." while LLM training with a trained BPE tokenizer, #997

Uh oh!

tmrogo Apr 4, 2026

Replies: 1 comment · 2 replies

Uh oh!

rasbt Apr 4, 2026 Maintainer

Uh oh!

tmrogo Apr 5, 2026 Author

Uh oh!

rasbt Apr 5, 2026 Maintainer

tmrogo
Apr 4, 2026

Replies: 1 comment 2 replies

rasbt
Apr 4, 2026
Maintainer

tmrogo Apr 5, 2026
Author

rasbt Apr 5, 2026
Maintainer