I wonder how to extract patch/local features in 768 dim of CoCa for downstream tasks? Should I use the attn_pool (for caption) to get (256,768)? But I think it is for captioning and maybe is not suitable for discriminitive tasks. Any other resolutions?