perf: optimize qwen3.5 hybrid linear cache flow[4/N].#1130
perf: optimize qwen3.5 hybrid linear cache flow[4/N].#1130yingxudeng wants to merge 1 commit intojd-opensource:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces hybrid linear cache optimizations for NPU, specifically targeting models like Qwen3 with Gated Delta Net. It adds support for compact linear block tables, updates the KV cache allocation logic to handle hybrid layers, and integrates these changes across the distributed runtime, including serialization and worker implementations. Feedback highlights duplicated logic for attention feature conflict checks in LLMEngine and concerns regarding the robustness of the fallback mechanism for determining block size in the NPU-specific gated delta net implementation.
| bool enable_lighting_indexer = args_.index_n_heads() > 1; | ||
| bool enable_gdn_attention = has_linear_attention_layers(args_); | ||
| #if defined(USE_NPU) | ||
| CHECK(!(enable_lighting_indexer && enable_gdn_attention)) | ||
| << "KVCache does not support linear attention and lighting indexer " | ||
| "enabled at the same time."; | ||
| #endif |
There was a problem hiding this comment.
There is duplicated logic for checking attention feature conflicts. The same check for enable_lighting_indexer and enable_gdn_attention is also present in estimate_kv_cache_capacity (lines 527-531). This duplication can lead to maintenance issues where one instance is updated but the other is missed.
Consider refactoring this logic to avoid duplication. For example, you could calculate these boolean flags once in init_model() and store them as member variables of LLMEngine. The check could then be performed once, and both estimate_kv_cache_capacity and allocate_kv_cache can use these member variables.
| int64_t resolve_linear_state_block_size(const KVCache& kv_cache, | ||
| const ModelInputParams& input_params) { | ||
| int64_t block_size = kv_cache.block_size(); | ||
| const auto k_cache = kv_cache.get_k_cache(); | ||
| if (block_size <= 0 && k_cache.defined() && k_cache.numel() > 0 && | ||
| k_cache.dim() >= 2) { | ||
| block_size = k_cache.size(1); | ||
| } | ||
| if (block_size <= 0 && input_params.paged_kv_last_page_len.defined() && | ||
| input_params.paged_kv_last_page_len.numel() > 0) { | ||
| block_size = std::max<int64_t>( | ||
| 1, input_params.paged_kv_last_page_len.max().item<int64_t>()); | ||
| } | ||
| return std::max<int64_t>(1, block_size); | ||
| } |
There was a problem hiding this comment.
The fallback logic for determining block_size in resolve_linear_state_block_size is brittle and potentially incorrect.
-
block_size = k_cache.size(1);assumes a specific tensor layout ([n_blocks, block_size, ...]) which might not hold true for all hardware backends (e.g., MLU has a different layout[n_blocks, n_head, block_size, ...]). While this code is in an NPU-specific directory, such platform-dependent logic is fragile. -
block_size = std::max<int64_t>(1, input_params.paged_kv_last_page_len.max().item<int64_t>());is also unreliable.paged_kv_last_page_lenrepresents the number of tokens in the last block of a sequence, which can be less than the actualblock_size. Relying on the maximum value across a batch does not guarantee the correct block size.
Although the primary mechanism kv_cache.block_size() should work correctly with the new changes, these fallbacks introduce a risk of incorrect behavior if the primary method fails for any reason. It would be more robust to ensure block_size is always passed down correctly, for example from EngineOptions, rather than trying to infer it with these fallbacks.
| inline std::vector<int32_t> build_hybrid_linear_state_block_table( | ||
| const std::vector<int32_t>& block_ids, | ||
| uint32_t prev_tokens, | ||
| uint32_t total_tokens, | ||
| int32_t block_size) { | ||
| assert(!block_ids.empty() && "linear state block table requires blocks"); | ||
| assert(block_size > 0 && "linear state block table requires block_size > 0"); | ||
| assert(total_tokens > 0 && | ||
| "linear state block table requires total_tokens > 0"); | ||
|
|
||
| const int64_t max_block_index = static_cast<int64_t>(block_ids.size()) - 1; | ||
| const int64_t dst_block_index = std::min<int64_t>( | ||
| max_block_index, static_cast<int64_t>((total_tokens - 1) / block_size)); | ||
| int64_t src_block_index = dst_block_index; | ||
| if (prev_tokens > 0) { | ||
| src_block_index = std::min<int64_t>( | ||
| max_block_index, static_cast<int64_t>((prev_tokens - 1) / block_size)); | ||
| } | ||
|
|
||
| return {block_ids[src_block_index], block_ids[dst_block_index]}; | ||
| } |
There was a problem hiding this comment.
why do we need prev_tokens and total_tokens to calculate the src_block_index and dst_block_index for linear cache. Is this method safe for multi batch scenario?
No description provided.