Skip to content

Is it possible to apply attention rollout in Hiera? #43

@jefflai0412

Description

@jefflai0412

Hello,
I am working on implementing attention rollout for the Hiera model, but I encountered challenges due to Hiera’s hierarchical pooling (q_stride), Mask Unit Attention (MU), and Unroll/Reroll operations.

In standard Vision Transformers (ViTs), attention rollout assumes that the identities of input tokens are linearly combined through the layers based on attention weights. However, Hiera differs in key ways:

  1. Hierarchical Pooling (q_stride)

    • In many layers, q_stride > 1 applies spatial pooling, reducing the number of tokens before attention.
    • This means input tokens are not directly mixed linearly through self-attention, breaking the standard assumption of attention rollout.
  2. Mask Unit Attention (MU)

    • In earlier stages, attention is confined to local mask units, meaning that some tokens never directly interact with others.
    • This contrasts with standard ViTs, where every token can eventually attend to every other token.
  3. Unroll and Reroll Transformations

    • Intermediate token representations are spatially reshaped and reordered multiple times using Unroll and Reroll.
    • This makes it difficult to track token dependencies consistently through layers.

Questions:

  1. Is attention rollout applicable to Hiera given these hierarchical operations?

    • Since q_stride pools tokens, how should token contributions be propagated correctly?
    • Can attention rollout be modified to properly handle hierarchical token aggregation?
  2. How can we adapt attention rollout to respect Unroll/Reroll transformations?

    • Should we first "undo" unrolling before applying attention rollout?
    • Are there any internal functions in Hiera that can help track token mappings across hierarchy levels?
  3. Would an alternative approach like hierarchical attention flow be more appropriate?

    • Instead of naive attention propagation, should we track token aggregation across pooling layers before applying rollout?
    • Any recommendations for how to do this efficiently?

I appreciate any insights or guidance on this.

refs:
Quantifying Attention Flow in Transformers: https://arxiv.org/pdf/2005.00928
example of visualizing vit's attention: https://www.kaggle.com/code/piantic/vision-transformer-vit-visualize-attention-map

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions