Skip to content

perf: optimize qwen3.5 hybrid linear cache flow[4/N].#1160

Open
JC-ut0 wants to merge 1 commit intojd-opensource:mainfrom
JC-ut0:gdn_cache_fix
Open

perf: optimize qwen3.5 hybrid linear cache flow[4/N].#1160
JC-ut0 wants to merge 1 commit intojd-opensource:mainfrom
JC-ut0:gdn_cache_fix

Conversation

@JC-ut0
Copy link
Copy Markdown
Contributor

@JC-ut0 JC-ut0 commented Apr 1, 2026

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for hybrid attention models (such as qwen3_next) by differentiating between full attention and linear (GDN) attention layers during KV cache estimation and allocation. Key changes include updating LLMEngine and RecEngine to calculate cache capacity based on specific layer types, adding logic to AclGraph to correctly identify valid KV caches in mixed-layer models, and refactoring WorkerImpl to selectively allocate specific cache tensors (conv/ssm vs. key/value) per layer. Review feedback highlights the need for better consistency across the engine by utilizing the centralized is_full_attention_layer helper function to avoid logic errors related to default attention intervals and potential division-by-zero issues.

Comment on lines +522 to +525
int64_t num_full_attention_layers =
kv_cache_cap.n_layers / args_.full_attention_interval();
int64_t num_linear_attention_layers =
kv_cache_cap.n_layers - num_full_attention_layers;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This calculation is susceptible to a division by zero error if full_attention_interval() is 0 (the default value in ModelArgs). Additionally, it does not account for configurations that use explicit layer_types. It is recommended to use the is_full_attention_layer helper function in a loop to accurately and safely count the layers, ensuring consistency with the rest of the engine.

Suggested change
int64_t num_full_attention_layers =
kv_cache_cap.n_layers / args_.full_attention_interval();
int64_t num_linear_attention_layers =
kv_cache_cap.n_layers - num_full_attention_layers;
int64_t num_full_attention_layers = 0;
int64_t num_linear_attention_layers = 0;
for (int64_t i = 0; i < kv_cache_cap.n_layers; ++i) {
if (is_full_attention_layer(args_, i)) {
num_full_attention_layers++;
} else {
num_linear_attention_layers++;
}
}

Comment on lines +180 to +199
int64_t num_full_attention_layers = kv_cache_cap.n_layers;
int64_t num_linear_attention_layers = 0;
if (args_.full_attention_interval() > 0) {
num_full_attention_layers =
kv_cache_cap.n_layers / args_.full_attention_interval();
num_linear_attention_layers =
kv_cache_cap.n_layers - num_full_attention_layers;
} else if (has_linear_attention_layers(args_)) {
// Count layer types from args.layer_types()
const auto& layer_types = args_.layer_types();
num_full_attention_layers = 0;
num_linear_attention_layers = 0;
for (const auto& layer_type : layer_types) {
if (layer_type == "full_attention" || layer_type == "attention") {
num_full_attention_layers++;
} else {
num_linear_attention_layers++;
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The manual counting of layers here is inconsistent with the is_full_attention_layer helper in model_args.h. Specifically, it fails to handle the default interval of 4 when full_attention_interval is 0, which the helper correctly addresses. Using the centralized helper function ensures consistency across the engine and simplifies the logic.

  int64_t num_full_attention_layers = 0;
  int64_t num_linear_attention_layers = 0;
  for (int64_t i = 0; i < kv_cache_cap.n_layers; ++i) {
    if (is_full_attention_layer(args_, i)) {
      num_full_attention_layers++;
    } else {
      num_linear_attention_layers++;
    }
  }

Comment on lines +240 to +245
auto is_linear_attention_layer = [&](int64_t layer_idx) {
if (args.full_attention_interval() > 1) {
return (layer_idx + 1) % args.full_attention_interval() != 0;
}
return false;
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The is_linear_attention_layer logic here is inconsistent with the is_full_attention_layer helper. It ignores layer_types and the default interval of 4 for zero-valued intervals. Please use the existing helper function to ensure correct KV cache allocation and consistent behavior with the engine's memory estimation.

    // Helper function to check if a layer is linear attention
    auto is_linear_attention_layer = [&](int64_t layer_idx) {
      return !is_full_attention_layer(args, layer_idx);
    };

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant