How can we get the better performance of SigLIP using full scratch training? #1128
Replies: 2 comments
-
|
@yutojubako I've only ever trained at cc12m scale from scratch, using the SigLIP loss. It worked, and appeared close to, maybe slightly better than CLIP contrastive (InfoNCE) loss. I don't currently have resources to train at 5B data scale so cannot try that myself. I'd hope anyone with those resources would be putting some effort into optimizing/improving the impl here as it'd be useful. @JeniaJitsev and some collaborators did spend some time trying the SigLIP loss. It was pointed out to me that it didn't scale as well with world size as the contrastive loss w/ the local_loss mode we added here. In response I tried implementing some alternate approaches to the loss... there's bidir exchange, shift exchange, and a reduce and gather option. Gather is closest to the contrastive loss in terms of the collectives used, but still involves an iteration that's not in the contrastive that likely adds op latency. The original paper was using some sort of JAX impl, don't think the actual impl of that which was used at scale for training was ever released. JAX has different impl of collectives and is likely compiling the whole model + loss & optimizer step together. It could reduce the overhead. I recommended that someone try to torch.compile the combined model + loss application. Not sure if that was done. It's also worth pointing out the original SigLIP models are trained with a dataset nobody has access to. The behaviour of the different loss functions on this private dataset could differ from behaviour on other datasets, especially if there is a difference in noise, quality, duplicates, or other characteristics. |
Beta Was this translation helpful? Give feedback.
-
|
@rwightman Are there any shared details regarding the number of total seen samples when conducting experiments with cc12m from scratch, Since I would like to check if it works to some extent for cc12m, please let me know if there are recommended settings within the community. (In particular, I feel that |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
When attempting to train SigLIP from scratch using this codebase, is it impossible to reproduce?
Even with an environment heavily utilizing H100 GPUs, we encounter issues such as:
Is there no way to train except by reducing the batch size? We are interested in training from scratch using a 5B-scale dataset, rather than fine-tuning.
Could you provide advice on training SigLIP from scratch?
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions