perf: optimize qwen3.5 hybrid linear cache flow[4/N].#1160
perf: optimize qwen3.5 hybrid linear cache flow[4/N].#1160JC-ut0 wants to merge 1 commit intojd-opensource:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for hybrid attention models (such as qwen3_next) by differentiating between full attention and linear (GDN) attention layers during KV cache estimation and allocation. Key changes include updating LLMEngine and RecEngine to calculate cache capacity based on specific layer types, adding logic to AclGraph to correctly identify valid KV caches in mixed-layer models, and refactoring WorkerImpl to selectively allocate specific cache tensors (conv/ssm vs. key/value) per layer. Review feedback highlights the need for better consistency across the engine by utilizing the centralized is_full_attention_layer helper function to avoid logic errors related to default attention intervals and potential division-by-zero issues.
| int64_t num_full_attention_layers = | ||
| kv_cache_cap.n_layers / args_.full_attention_interval(); | ||
| int64_t num_linear_attention_layers = | ||
| kv_cache_cap.n_layers - num_full_attention_layers; |
There was a problem hiding this comment.
This calculation is susceptible to a division by zero error if full_attention_interval() is 0 (the default value in ModelArgs). Additionally, it does not account for configurations that use explicit layer_types. It is recommended to use the is_full_attention_layer helper function in a loop to accurately and safely count the layers, ensuring consistency with the rest of the engine.
| int64_t num_full_attention_layers = | |
| kv_cache_cap.n_layers / args_.full_attention_interval(); | |
| int64_t num_linear_attention_layers = | |
| kv_cache_cap.n_layers - num_full_attention_layers; | |
| int64_t num_full_attention_layers = 0; | |
| int64_t num_linear_attention_layers = 0; | |
| for (int64_t i = 0; i < kv_cache_cap.n_layers; ++i) { | |
| if (is_full_attention_layer(args_, i)) { | |
| num_full_attention_layers++; | |
| } else { | |
| num_linear_attention_layers++; | |
| } | |
| } |
| int64_t num_full_attention_layers = kv_cache_cap.n_layers; | ||
| int64_t num_linear_attention_layers = 0; | ||
| if (args_.full_attention_interval() > 0) { | ||
| num_full_attention_layers = | ||
| kv_cache_cap.n_layers / args_.full_attention_interval(); | ||
| num_linear_attention_layers = | ||
| kv_cache_cap.n_layers - num_full_attention_layers; | ||
| } else if (has_linear_attention_layers(args_)) { | ||
| // Count layer types from args.layer_types() | ||
| const auto& layer_types = args_.layer_types(); | ||
| num_full_attention_layers = 0; | ||
| num_linear_attention_layers = 0; | ||
| for (const auto& layer_type : layer_types) { | ||
| if (layer_type == "full_attention" || layer_type == "attention") { | ||
| num_full_attention_layers++; | ||
| } else { | ||
| num_linear_attention_layers++; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The manual counting of layers here is inconsistent with the is_full_attention_layer helper in model_args.h. Specifically, it fails to handle the default interval of 4 when full_attention_interval is 0, which the helper correctly addresses. Using the centralized helper function ensures consistency across the engine and simplifies the logic.
int64_t num_full_attention_layers = 0;
int64_t num_linear_attention_layers = 0;
for (int64_t i = 0; i < kv_cache_cap.n_layers; ++i) {
if (is_full_attention_layer(args_, i)) {
num_full_attention_layers++;
} else {
num_linear_attention_layers++;
}
}| auto is_linear_attention_layer = [&](int64_t layer_idx) { | ||
| if (args.full_attention_interval() > 1) { | ||
| return (layer_idx + 1) % args.full_attention_interval() != 0; | ||
| } | ||
| return false; | ||
| }; |
There was a problem hiding this comment.
The is_linear_attention_layer logic here is inconsistent with the is_full_attention_layer helper. It ignores layer_types and the default interval of 4 for zero-valued intervals. Please use the existing helper function to ensure correct KV cache allocation and consistent behavior with the engine's memory estimation.
// Helper function to check if a layer is linear attention
auto is_linear_attention_layer = [&](int64_t layer_idx) {
return !is_full_attention_layer(args, layer_idx);
};
No description provided.