How can we get the better performance of SigLIP using full scratch training? #1128

yutojubako · 2025-11-26T02:17:28Z

yutojubako
Nov 26, 2025

Hello,
When attempting to train SigLIP from scratch using this codebase, is it impossible to reproduce?
Even with an environment heavily utilizing H100 GPUs, we encounter issues such as:

Out of Memory errors preventing training; the same batch size works fine without SigLIP (global batch size = 64k).
Contrastive loss (without Sigmoid) provides more stable learning compared to Sigmoid Loss.
- Using Sigmoid Loss increases likelihood of loss spikes.

Is there no way to train except by reducing the batch size? We are interested in training from scratch using a 5B-scale dataset, rather than fine-tuning.
Could you provide advice on training SigLIP from scratch?
Thank you.

rwightman · 2025-11-26T20:53:30Z

rwightman
Nov 26, 2025
Maintainer

@yutojubako I've only ever trained at cc12m scale from scratch, using the SigLIP loss. It worked, and appeared close to, maybe slightly better than CLIP contrastive (InfoNCE) loss.

I don't currently have resources to train at 5B data scale so cannot try that myself. I'd hope anyone with those resources would be putting some effort into optimizing/improving the impl here as it'd be useful.

@JeniaJitsev and some collaborators did spend some time trying the SigLIP loss. It was pointed out to me that it didn't scale as well with world size as the contrastive loss w/ the local_loss mode we added here.

In response I tried implementing some alternate approaches to the loss... there's bidir exchange, shift exchange, and a reduce and gather option. Gather is closest to the contrastive loss in terms of the collectives used, but still involves an iteration that's not in the contrastive that likely adds op latency.

The original paper was using some sort of JAX impl, don't think the actual impl of that which was used at scale for training was ever released. JAX has different impl of collectives and is likely compiling the whole model + loss & optimizer step together. It could reduce the overhead.

I recommended that someone try to torch.compile the combined model + loss application. Not sure if that was done.

It's also worth pointing out the original SigLIP models are trained with a dataset nobody has access to. The behaviour of the different loss functions on this private dataset could differ from behaviour on other datasets, especially if there is a difference in noise, quality, duplicates, or other characteristics.

0 replies

yutojubako · 2025-12-02T05:58:25Z

yutojubako
Dec 2, 2025
Author

@rwightman
Thank you for your response!

Are there any shared details regarding the number of total seen samples when conducting experiments with cc12m from scratch, batch size, number of nodes(=gpus), or the time taken for training?

Since I would like to check if it works to some extent for cc12m, please let me know if there are recommended settings within the community.

(In particular, I feel that train-num-samples param is not widely shared. Does this mean that we are supposed to use it by reversing-calculating from total seen samples and adjusting it according to our own computing environment?)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can we get the better performance of SigLIP using full scratch training? #1128

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How can we get the better performance of SigLIP using full scratch training? #1128

Uh oh!

yutojubako Nov 26, 2025

Replies: 2 comments

Uh oh!

Uh oh!

rwightman Nov 26, 2025 Maintainer

Uh oh!

yutojubako Dec 2, 2025 Author

yutojubako
Nov 26, 2025

rwightman
Nov 26, 2025
Maintainer

yutojubako
Dec 2, 2025
Author