Questions about model scaling

This paper is excellent! I have some questions to consult:
1. Is the d_token mentioned in the paper the same as hidden_size?
2. Section 3.3 of the paper mentions: The dimension d of the new parameters is the same as that of the old parameters, and the number of new and old learnable parameter tokens are m and n, respectively. When expanding the model, it is only necessary to concatenate them into a size of (m+n)d. In the [code](https://github.com/Haiyang-W/TokenFormer/blob/128f2fe308d79353052c7ca4972dbbdfd89c24dd/megatron/model/tokenformer.py#L212) , hidden_size is used to assign value to dimension d. However, in the [150M_eval.yml](https://github.com/Haiyang-W/TokenFormer/blob/main/configs/tokenformer/150M_eval.yml) and the[ 450M_eval.yml](https://github.com/Haiyang-W/TokenFormer/blob/main/configs/tokenformer/450M_eval.yml), the size of hidden_size changes. This suggests that when expanding from a smaller model to a larger one, it would not be possible to concatenate parameters along the dimension of the number of learnable parameter tokens as mentioned in section 3.3. This seems inconsistent with what is written in the paper. Could it be that I am misunderstanding this? Could you please explain how exactly you implemented the model scaling?

Looking forward to your reply.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about model scaling #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Questions about model scaling #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions