-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Hi all,
thank you so much for this awesome library!!
I am using the vision transformer CLIP models and I would like to test some properties with different number of patches.
The problem I have now is that there is only a limited number of models available with different patch sizes. For example, there are only the ViT-B/16 and the ViT-B/32 models, which have the same number of trainable parameters but different patch sizes.
Therefore, I would like to emulate different patch sizes by scaling the input images and fine-tuning the model. I know that the ViTs expect input images of size 224x224. However, in the original CLIP paper (sec. 3.2) they are using higher resolution images for fine-tuning.
So now to my question:
- Would it be possible to fine-tune the ViT-B/32 model with higher resolution images to emulate an arbitrary smaller batch size?
- How can I interpolate the positional embeddings? I have found a function for that here within this repository. But as far as I can see, it is never used anywhere.
Thank you so much for your help!