Fine-tune ViT Models on Higher Resolution Images

Hi all, 
thank you so much for this awesome library!!

I am using the vision transformer CLIP models and I would like to test some properties with different number of patches.
The problem I have now is that there is only a limited number of models available with different patch sizes. For example, there are only the ViT-B/16 and the ViT-B/32 models, which have the same number of trainable parameters but different patch sizes.

Therefore, I would like to emulate different patch sizes by scaling the input images and fine-tuning the model. I know that the ViTs expect input images of size `224x224`. However, in the [original CLIP paper](https://arxiv.org/pdf/2010.11929) (sec. 3.2) they are using higher resolution images for fine-tuning. 

**So now to my question:**
1. Would it be possible to fine-tune the ViT-B/32 model with higher resolution images to emulate an arbitrary smaller batch size? 
2. How can I interpolate the positional embeddings? I have found a function for that [here](https://github.com/mlfoundations/open_clip/blob/49eac2f27a5bb98a7f7ecc1154918880aa55256c/src/open_clip/pos_embed.py#L75) within this repository. But as far as I can see, it is never used anywhere.

Thank you so much for your help!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tune ViT Models on Higher Resolution Images #985

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fine-tune ViT Models on Higher Resolution Images #985

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions