Is it possible to apply attention rollout in Hiera?

Hello,  
I am working on implementing **attention rollout** for the Hiera model, but I encountered challenges due to Hiera’s **hierarchical pooling (q_stride), Mask Unit Attention (MU), and Unroll/Reroll operations**.  

In standard Vision Transformers (ViTs), attention rollout assumes that the **identities of input tokens are linearly combined** through the layers based on attention weights. However, Hiera differs in key ways:  

1. **Hierarchical Pooling (`q_stride`)**  
   - In many layers, `q_stride > 1` applies **spatial pooling**, reducing the number of tokens before attention.  
   - This means input tokens are **not directly mixed linearly** through self-attention, breaking the standard assumption of attention rollout.  

2. **Mask Unit Attention (MU)**  
   - In earlier stages, attention is confined to **local mask units**, meaning that some tokens never directly interact with others.  
   - This contrasts with standard ViTs, where every token can eventually attend to every other token.  

3. **Unroll and Reroll Transformations**  
   - Intermediate token representations are **spatially reshaped and reordered** multiple times using `Unroll` and `Reroll`.  
   - This makes it difficult to **track token dependencies consistently** through layers.  

### **Questions:**  
1. **Is attention rollout applicable to Hiera given these hierarchical operations?**  
   - Since `q_stride` pools tokens, how should token contributions be propagated correctly?  
   - Can attention rollout be modified to properly handle hierarchical token aggregation?  

2. **How can we adapt attention rollout to respect Unroll/Reroll transformations?**  
   - Should we first "undo" unrolling before applying attention rollout?  
   - Are there any internal functions in Hiera that can help track token mappings across hierarchy levels?  

3. **Would an alternative approach like hierarchical attention flow be more appropriate?**  
   - Instead of naive attention propagation, should we **track token aggregation** across pooling layers before applying rollout?  
   - Any recommendations for how to do this efficiently?  

I appreciate any insights or guidance on this.

refs:
Quantifying Attention Flow in Transformers: https://arxiv.org/pdf/2005.00928
example of visualizing vit's attention: https://www.kaggle.com/code/piantic/vision-transformer-vit-visualize-attention-map

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to apply attention rollout in Hiera? #43

Questions:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Is it possible to apply attention rollout in Hiera? #43

Description

Questions:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions