Low codebook usage in training tokenizer

When I train a tokenizer from scratch on my own dataset using the VP2-16384.config, the codebook usage is much lower than 100% which is reported in the XQ-GAN paper. Typically, low codebook usage results in suboptimal tokenizer reconstruction and negatively impacts downstream tasks. Do you have any suggestions on how I could address this issue?

<img width="813" height="41" alt="Image" src="https://github.com/user-attachments/assets/ace2bd14-9f64-470f-9fce-84a34c865224" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low codebook usage in training tokenizer #29

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Low codebook usage in training tokenizer #29

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions