-
Notifications
You must be signed in to change notification settings - Fork 43
Questions about model scaling #8
Copy link
Copy link
Open
Description
This paper is excellent! I have some questions to consult:
- Is the d_token mentioned in the paper the same as hidden_size?
- Section 3.3 of the paper mentions: The dimension d of the new parameters is the same as that of the old parameters, and the number of new and old learnable parameter tokens are m and n, respectively. When expanding the model, it is only necessary to concatenate them into a size of (m+n)d. In the code , hidden_size is used to assign value to dimension d. However, in the 150M_eval.yml and the 450M_eval.yml, the size of hidden_size changes. This suggests that when expanding from a smaller model to a larger one, it would not be possible to concatenate parameters along the dimension of the number of learnable parameter tokens as mentioned in section 3.3. This seems inconsistent with what is written in the paper. Could it be that I am misunderstanding this? Could you please explain how exactly you implemented the model scaling?
Looking forward to your reply.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels