Replies: 1 comment
-
|
Hi there, that's a good and valid point. I'll move this to the discussion in case others have a similar question. So, in the chapter 6, which is a basic version of sequence classification with an autoregressive LLM, we use the final position as a summary token. I.e., with causal attention, that last position can attend to all earlier positions, so the model can learn to summarize information for classification there. (If you are familiar with BERT, it's like it behaves a bit like an implicit CLS token in BERT.) The wording in the chapter was probably not ideal and causes this confusion. It is not literally “the last real token representation" but it is more like the hidden state at the last padded slot. And yes, the bonus materials have more sophisticated alternatives (https://github.com/rasbt/LLMs-from-scratch/tree/main/ch06/02_bonus_additional-experiments). E.g.,
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I found a technical issue in the book regarding sequence classification with LLMs fine-tuning. The text suggests using hidden_states[:, -1, :] to extract the last token's representation.
Because of right-padding, the token at index [-1] is usually a vector corresponding to the [PAD] token, which may be replaced with the last valid token dynamically using the attention_mask?
Beta Was this translation helpful? Give feedback.
All reactions