Skip to content

Fine-tune ViT Models on Higher Resolution Images #985

@D0miH

Description

@D0miH

Hi all,
thank you so much for this awesome library!!

I am using the vision transformer CLIP models and I would like to test some properties with different number of patches.
The problem I have now is that there is only a limited number of models available with different patch sizes. For example, there are only the ViT-B/16 and the ViT-B/32 models, which have the same number of trainable parameters but different patch sizes.

Therefore, I would like to emulate different patch sizes by scaling the input images and fine-tuning the model. I know that the ViTs expect input images of size 224x224. However, in the original CLIP paper (sec. 3.2) they are using higher resolution images for fine-tuning.

So now to my question:

  1. Would it be possible to fine-tune the ViT-B/32 model with higher resolution images to emulate an arbitrary smaller batch size?
  2. How can I interpolate the positional embeddings? I have found a function for that here within this repository. But as far as I can see, it is never used anywhere.

Thank you so much for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions