perf: optimize qwen3.5 hybrid linear cache flow[4/N].#1160
perf: optimize qwen3.5 hybrid linear cache flow[4/N].#1160yingxudeng merged 2 commits intojd-opensource:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for hybrid attention models (such as qwen3_next) by differentiating between full attention and linear (GDN) attention layers during KV cache estimation and allocation. Key changes include updating LLMEngine and RecEngine to calculate cache capacity based on specific layer types, adding logic to AclGraph to correctly identify valid KV caches in mixed-layer models, and refactoring WorkerImpl to selectively allocate specific cache tensors (conv/ssm vs. key/value) per layer. Review feedback highlights the need for better consistency across the engine by utilizing the centralized is_full_attention_layer helper function to avoid logic errors related to default attention intervals and potential division-by-zero issues.
Removed unused layer types variable from worker_impl.cpp
|
BTW, the KV cache initialization needs to be split out into a separate function. It's too complex right now. |
Sure,this kv cache initialization will be refactored in the next PR. |

Add logic to AclGraph to correctly identify valid KV caches in mixed-layer models, and refactor WorkerImpl to selectively allocate specific cache tensors (conv/ssm vs. key/value) per layer.