A latent mistake about the semantic representation of the message within predicted label computation in LLM classification fine-tuning #992

Jack1447 · 2026-03-30T03:43:47Z

Jack1447
Mar 30, 2026

I found a technical issue in the book regarding sequence classification with LLMs fine-tuning. The text suggests using hidden_states[:, -1, :] to extract the last token's representation.

Because of right-padding, the token at index [-1] is usually a vector corresponding to the [PAD] token, which may be replaced with the last valid token dynamically using the attention_mask?

rasbt · 2026-03-30T14:29:35Z

rasbt
Mar 30, 2026
Maintainer

Hi there, that's a good and valid point. I'll move this to the discussion in case others have a similar question.

So, in the chapter 6, which is a basic version of sequence classification with an autoregressive LLM, we use the final position as a summary token. I.e., with causal attention, that last position can attend to all earlier positions, so the model can learn to summarize information for classification there. (If you are familiar with BERT, it's like it behaves a bit like an implicit CLS token in BERT.)

The wording in the chapter was probably not ideal and causes this confusion. It is not literally “the last real token representation" but it is more like the hidden state at the last padded slot.

And yes, the bonus materials have more sophisticated alternatives (https://github.com/rasbt/LLMs-from-scratch/tree/main/ch06/02_bonus_additional-experiments).

E.g.,

--trainable_token_pos="flexible" selects the last non-padding token per example (row 16)
--no_padding --batch_size 1 avoids the issue entirely (row 14)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A latent mistake about the semantic representation of the message within predicted label computation in LLM classification fine-tuning #992

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

A latent mistake about the semantic representation of the message within predicted label computation in LLM classification fine-tuning #992

Uh oh!

Jack1447 Mar 30, 2026

Replies: 1 comment

Uh oh!

rasbt Mar 30, 2026 Maintainer

Jack1447
Mar 30, 2026

rasbt
Mar 30, 2026
Maintainer