perf: optimize qwen3.5 hybrid linear cache flow[4/N]. by yingxudeng · Pull Request #1130 · jd-opensource/xllm

yingxudeng · 2026-03-30T03:47:16Z

No description provided.

gemini-code-assist

Code Review

This pull request introduces hybrid linear cache optimizations for NPU, specifically targeting models like Qwen3 with Gated Delta Net. It adds support for compact linear block tables, updates the KV cache allocation logic to handle hybrid layers, and integrates these changes across the distributed runtime, including serialization and worker implementations. Feedback highlights duplicated logic for attention feature conflict checks in LLMEngine and concerns regarding the robustness of the fallback mechanism for determining block size in the NPU-specific gated delta net implementation.

gemini-code-assist · 2026-03-30T03:50:11Z

xllm/core/distributed_runtime/llm_engine.cpp

  bool enable_lighting_indexer = args_.index_n_heads() > 1;
  bool enable_gdn_attention = has_linear_attention_layers(args_);
+#if defined(USE_NPU)
+  CHECK(!(enable_lighting_indexer && enable_gdn_attention))
+      << "KVCache does not support linear attention and lighting indexer "
+         "enabled at the same time.";
+#endif


There is duplicated logic for checking attention feature conflicts. The same check for enable_lighting_indexer and enable_gdn_attention is also present in estimate_kv_cache_capacity (lines 527-531). This duplication can lead to maintenance issues where one instance is updated but the other is missed.

Consider refactoring this logic to avoid duplication. For example, you could calculate these boolean flags once in init_model() and store them as member variables of LLMEngine. The check could then be performed once, and both estimate_kv_cache_capacity and allocate_kv_cache can use these member variables.

gemini-code-assist · 2026-03-30T03:50:11Z

xllm/core/layers/npu_torch/qwen3_gated_delta_net_base.cpp

+int64_t resolve_linear_state_block_size(const KVCache& kv_cache,
+                                        const ModelInputParams& input_params) {
+  int64_t block_size = kv_cache.block_size();
+  const auto k_cache = kv_cache.get_k_cache();
+  if (block_size <= 0 && k_cache.defined() && k_cache.numel() > 0 &&
+      k_cache.dim() >= 2) {
+    block_size = k_cache.size(1);
+  }
+  if (block_size <= 0 && input_params.paged_kv_last_page_len.defined() &&
+      input_params.paged_kv_last_page_len.numel() > 0) {
+    block_size = std::max<int64_t>(
+        1, input_params.paged_kv_last_page_len.max().item<int64_t>());
+  }
+  return std::max<int64_t>(1, block_size);
+}


The fallback logic for determining block_size in resolve_linear_state_block_size is brittle and potentially incorrect.

block_size = k_cache.size(1); assumes a specific tensor layout ([n_blocks, block_size, ...]) which might not hold true for all hardware backends (e.g., MLU has a different layout [n_blocks, n_head, block_size, ...]). While this code is in an NPU-specific directory, such platform-dependent logic is fragile.

block_size = std::max<int64_t>(1, input_params.paged_kv_last_page_len.max().item<int64_t>()); is also unreliable. paged_kv_last_page_len represents the number of tokens in the last block of a sequence, which can be less than the actual block_size. Relying on the maximum value across a batch does not guarantee the correct block size.

Although the primary mechanism kv_cache.block_size() should work correctly with the new changes, these fallbacks introduce a risk of incorrect behavior if the primary method fails for any reason. It would be more robust to ensure block_size is always passed down correctly, for example from EngineOptions, rather than trying to infer it with these fallbacks.

JC-ut0 · 2026-03-30T07:39:30Z

xllm/core/framework/kv_cache/hybrid_cache_utils.h

+inline std::vector<int32_t> build_hybrid_linear_state_block_table(
+    const std::vector<int32_t>& block_ids,
+    uint32_t prev_tokens,
+    uint32_t total_tokens,
+    int32_t block_size) {
+  assert(!block_ids.empty() && "linear state block table requires blocks");
+  assert(block_size > 0 && "linear state block table requires block_size > 0");
+  assert(total_tokens > 0 &&
+         "linear state block table requires total_tokens > 0");
+
+  const int64_t max_block_index = static_cast<int64_t>(block_ids.size()) - 1;
+  const int64_t dst_block_index = std::min<int64_t>(
+      max_block_index, static_cast<int64_t>((total_tokens - 1) / block_size));
+  int64_t src_block_index = dst_block_index;
+  if (prev_tokens > 0) {
+    src_block_index = std::min<int64_t>(
+        max_block_index, static_cast<int64_t>((prev_tokens - 1) / block_size));
+  }
+
+  return {block_ids[src_block_index], block_ids[dst_block_index]};
+}


why do we need prev_tokens and total_tokens to calculate the src_block_index and dst_block_index for linear cache. Is this method safe for multi batch scenario?

perf: optimize qwen3.5 hybrid linear cache flow.

3067419

gemini-code-assist bot reviewed Mar 30, 2026

View reviewed changes

yingxudeng changed the title ~~perf: optimize qwen3.5 hybrid linear cache flow.~~ perf: optimize qwen3.5 hybrid linear cache flow[4/N]. Mar 30, 2026

yingxudeng marked this pull request as ready for review March 30, 2026 06:26

yingxudeng requested review from DongheJin, JimHsiung, RobbieLeung, XuZhang99, liutongxuan, walsonyang and yq33victor as code owners March 30, 2026 06:26

JC-ut0 reviewed Mar 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize qwen3.5 hybrid linear cache flow[4/N].#1130

perf: optimize qwen3.5 hybrid linear cache flow[4/N].#1130
yingxudeng wants to merge 1 commit intojd-opensource:mainfrom
yingxudeng:feat/qwen35-hybrid-cache-pr-prep-2

yingxudeng commented Mar 30, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 30, 2026

Uh oh!

gemini-code-assist bot Mar 30, 2026

Uh oh!

JC-ut0 Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yingxudeng commented Mar 30, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

JC-ut0 Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants