perf: optimize qwen3.5 hybrid linear cache flow[4/N]. by JC-ut0 · Pull Request #1160 · jd-opensource/xllm

JC-ut0 · 2026-04-01T09:16:09Z

No description provided.

gemini-code-assist

Code Review

This pull request introduces support for hybrid attention models (such as qwen3_next) by differentiating between full attention and linear (GDN) attention layers during KV cache estimation and allocation. Key changes include updating LLMEngine and RecEngine to calculate cache capacity based on specific layer types, adding logic to AclGraph to correctly identify valid KV caches in mixed-layer models, and refactoring WorkerImpl to selectively allocate specific cache tensors (conv/ssm vs. key/value) per layer. Review feedback highlights the need for better consistency across the engine by utilizing the centralized is_full_attention_layer helper function to avoid logic errors related to default attention intervals and potential division-by-zero issues.

gemini-code-assist · 2026-04-01T09:17:51Z

xllm/core/distributed_runtime/llm_engine.cpp

+  int64_t num_full_attention_layers =
+      kv_cache_cap.n_layers / args_.full_attention_interval();
+  int64_t num_linear_attention_layers =
+      kv_cache_cap.n_layers - num_full_attention_layers;


This calculation is susceptible to a division by zero error if full_attention_interval() is 0 (the default value in ModelArgs). Additionally, it does not account for configurations that use explicit layer_types. It is recommended to use the is_full_attention_layer helper function in a loop to accurately and safely count the layers, ensuring consistency with the rest of the engine.

Suggested change

int64_t num_full_attention_layers =

kv_cache_cap.n_layers / args_.full_attention_interval();

int64_t num_linear_attention_layers =

kv_cache_cap.n_layers - num_full_attention_layers;

int64_t num_full_attention_layers = 0;

int64_t num_linear_attention_layers = 0;

for (int64_t i = 0; i < kv_cache_cap.n_layers; ++i) {

if (is_full_attention_layer(args_, i)) {

num_full_attention_layers++;

} else {

num_linear_attention_layers++;

}

}

gemini-code-assist · 2026-04-01T09:17:51Z

xllm/core/distributed_runtime/rec_engine.cpp

+  int64_t num_full_attention_layers = kv_cache_cap.n_layers;
+  int64_t num_linear_attention_layers = 0;
+  if (args_.full_attention_interval() > 0) {
+    num_full_attention_layers =
+        kv_cache_cap.n_layers / args_.full_attention_interval();
+    num_linear_attention_layers =
+        kv_cache_cap.n_layers - num_full_attention_layers;
+  } else if (has_linear_attention_layers(args_)) {
+    // Count layer types from args.layer_types()
+    const auto& layer_types = args_.layer_types();
+    num_full_attention_layers = 0;
+    num_linear_attention_layers = 0;
+    for (const auto& layer_type : layer_types) {
+      if (layer_type == "full_attention" || layer_type == "attention") {
+        num_full_attention_layers++;
+      } else {
+        num_linear_attention_layers++;
+      }
+    }
+  }


The manual counting of layers here is inconsistent with the is_full_attention_layer helper in model_args.h. Specifically, it fails to handle the default interval of 4 when full_attention_interval is 0, which the helper correctly addresses. Using the centralized helper function ensures consistency across the engine and simplifies the logic.

int64_t num_full_attention_layers = 0; int64_t num_linear_attention_layers = 0; for (int64_t i = 0; i < kv_cache_cap.n_layers; ++i) { if (is_full_attention_layer(args_, i)) { num_full_attention_layers++; } else { num_linear_attention_layers++; } }

gemini-code-assist · 2026-04-01T09:17:52Z

xllm/core/runtime/worker_impl.cpp

+    auto is_linear_attention_layer = [&](int64_t layer_idx) {
+      if (args.full_attention_interval() > 1) {
+        return (layer_idx + 1) % args.full_attention_interval() != 0;
+      }
+      return false;
+    };


The is_linear_attention_layer logic here is inconsistent with the is_full_attention_layer helper. It ignores layer_types and the default interval of 4 for zero-valued intervals. Please use the existing helper function to ensure correct KV cache allocation and consistent behavior with the engine's memory estimation.

// Helper function to check if a layer is linear attention auto is_linear_attention_layer = [&](int64_t layer_idx) { return !is_full_attention_layer(args, layer_idx); };

perf: optimize qwen3.5 hybrid linear cache flow[4/N].

9494758

JC-ut0 requested review from DongheJin, JimHsiung, RobbieLeung, XuZhang99, liutongxuan, walsonyang and yq33victor as code owners April 1, 2026 09:16

gemini-code-assist bot reviewed Apr 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize qwen3.5 hybrid linear cache flow[4/N].#1160

perf: optimize qwen3.5 hybrid linear cache flow[4/N].#1160
JC-ut0 wants to merge 1 commit intojd-opensource:mainfrom
JC-ut0:gdn_cache_fix

JC-ut0 commented Apr 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 1, 2026

Uh oh!

gemini-code-assist bot Apr 1, 2026

Uh oh!

gemini-code-assist bot Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JC-ut0 commented Apr 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant