Regarding multi-scale training VAE

During the training process of VQVAE, when training with VQ-4096cofig, there will be a state_dic mismatch problem. The pre-trained model is vit_base_patch14_dinov2 (using 14x14 patches), but our model definition might be based on 16x16 patches. Could you tell me how to handle it