Skip to content

perf: optimize qwen3.5 hybrid linear cache flow[4/N].#1130

Open
yingxudeng wants to merge 1 commit intojd-opensource:mainfrom
yingxudeng:feat/qwen35-hybrid-cache-pr-prep-2
Open

perf: optimize qwen3.5 hybrid linear cache flow[4/N].#1130
yingxudeng wants to merge 1 commit intojd-opensource:mainfrom
yingxudeng:feat/qwen35-hybrid-cache-pr-prep-2

Conversation

@yingxudeng
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces hybrid linear cache optimizations for NPU, specifically targeting models like Qwen3 with Gated Delta Net. It adds support for compact linear block tables, updates the KV cache allocation logic to handle hybrid layers, and integrates these changes across the distributed runtime, including serialization and worker implementations. Feedback highlights duplicated logic for attention feature conflict checks in LLMEngine and concerns regarding the robustness of the fallback mechanism for determining block size in the NPU-specific gated delta net implementation.

Comment on lines 573 to +579
bool enable_lighting_indexer = args_.index_n_heads() > 1;
bool enable_gdn_attention = has_linear_attention_layers(args_);
#if defined(USE_NPU)
CHECK(!(enable_lighting_indexer && enable_gdn_attention))
<< "KVCache does not support linear attention and lighting indexer "
"enabled at the same time.";
#endif
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is duplicated logic for checking attention feature conflicts. The same check for enable_lighting_indexer and enable_gdn_attention is also present in estimate_kv_cache_capacity (lines 527-531). This duplication can lead to maintenance issues where one instance is updated but the other is missed.

Consider refactoring this logic to avoid duplication. For example, you could calculate these boolean flags once in init_model() and store them as member variables of LLMEngine. The check could then be performed once, and both estimate_kv_cache_capacity and allocate_kv_cache can use these member variables.

Comment on lines +313 to +327
int64_t resolve_linear_state_block_size(const KVCache& kv_cache,
const ModelInputParams& input_params) {
int64_t block_size = kv_cache.block_size();
const auto k_cache = kv_cache.get_k_cache();
if (block_size <= 0 && k_cache.defined() && k_cache.numel() > 0 &&
k_cache.dim() >= 2) {
block_size = k_cache.size(1);
}
if (block_size <= 0 && input_params.paged_kv_last_page_len.defined() &&
input_params.paged_kv_last_page_len.numel() > 0) {
block_size = std::max<int64_t>(
1, input_params.paged_kv_last_page_len.max().item<int64_t>());
}
return std::max<int64_t>(1, block_size);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The fallback logic for determining block_size in resolve_linear_state_block_size is brittle and potentially incorrect.

  1. block_size = k_cache.size(1); assumes a specific tensor layout ([n_blocks, block_size, ...]) which might not hold true for all hardware backends (e.g., MLU has a different layout [n_blocks, n_head, block_size, ...]). While this code is in an NPU-specific directory, such platform-dependent logic is fragile.

  2. block_size = std::max<int64_t>(1, input_params.paged_kv_last_page_len.max().item<int64_t>()); is also unreliable. paged_kv_last_page_len represents the number of tokens in the last block of a sequence, which can be less than the actual block_size. Relying on the maximum value across a batch does not guarantee the correct block size.

Although the primary mechanism kv_cache.block_size() should work correctly with the new changes, these fallbacks introduce a risk of incorrect behavior if the primary method fails for any reason. It would be more robust to ensure block_size is always passed down correctly, for example from EngineOptions, rather than trying to infer it with these fallbacks.

@yingxudeng yingxudeng changed the title perf: optimize qwen3.5 hybrid linear cache flow. perf: optimize qwen3.5 hybrid linear cache flow[4/N]. Mar 30, 2026
@yingxudeng yingxudeng marked this pull request as ready for review March 30, 2026 06:26
Comment on lines +104 to +124
inline std::vector<int32_t> build_hybrid_linear_state_block_table(
const std::vector<int32_t>& block_ids,
uint32_t prev_tokens,
uint32_t total_tokens,
int32_t block_size) {
assert(!block_ids.empty() && "linear state block table requires blocks");
assert(block_size > 0 && "linear state block table requires block_size > 0");
assert(total_tokens > 0 &&
"linear state block table requires total_tokens > 0");

const int64_t max_block_index = static_cast<int64_t>(block_ids.size()) - 1;
const int64_t dst_block_index = std::min<int64_t>(
max_block_index, static_cast<int64_t>((total_tokens - 1) / block_size));
int64_t src_block_index = dst_block_index;
if (prev_tokens > 0) {
src_block_index = std::min<int64_t>(
max_block_index, static_cast<int64_t>((prev_tokens - 1) / block_size));
}

return {block_ids[src_block_index], block_ids[dst_block_index]};
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need prev_tokens and total_tokens to calculate the src_block_index and dst_block_index for linear cache. Is this method safe for multi batch scenario?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants