feat: support FIA for qwen model on npu device.#1147
feat: support FIA for qwen model on npu device.#1147sanlio36 wants to merge 12 commits intojd-opensource:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces FIA (Fused Interlayer Attention) support for NPU-based Qwen2, Qwen3, and Qwen3-MoE decoder layers by adding a global configuration flag and implementing the required mask and index tensor logic. Review feedback highlights critical bugs where pointers are dereferenced without initialization, potentially causing runtime crashes. Additionally, the implementation is currently limited by a hardcoded sequence length of 2048 for FIA masks, and the Qwen2 variant pack incorrectly uses placeholders instead of the intended index tensors.
| ModelInputParams& input_params, | ||
| bool is_prefill) { | ||
| if (is_prefill) { | ||
| *prefill_param_.bs = std::max(1, input_params.num_sequences); |
There was a problem hiding this comment.
| bool is_prefill, | ||
| int node_id) { | ||
| if (is_prefill) { | ||
| *prefill_param_.bs = std::max(1, input_params.num_sequences); |
| const ModelInputParams& input_params, | ||
| bool is_prefill) { | ||
| if (is_prefill) { | ||
| *prefill_param_.bs = std::max(1, input_params.num_sequences); |
| namespace layer { | ||
|
|
||
| namespace { | ||
| constexpr int64_t kFiaMaskSeqLen = 2048; |
There was a problem hiding this comment.
Hardcoding kFiaMaskSeqLen to 2048 limits the maximum sequence length supported when FIA is enabled. If the input sequence length exceeds 2048, the fia_attn_mask_ (initialized at line 229) will have incorrect dimensions for the attention operation, potentially leading to incorrect results or crashes. This should be made dynamic or set to a sufficiently large value supported by the model's context window.
| if (is_prefill && FLAGS_enable_fia) { | ||
| node.variantPack.inTensors.at(input_offset++) = // bsnd padding_idx | ||
| placeholder_; | ||
| node.variantPack.inTensors.at(input_offset++) = // bsnd unpadding_idx | ||
| placeholder_; | ||
| } |
There was a problem hiding this comment.
The FIA index tensors are not being correctly passed to the variant pack; placeholder_ is used instead of the actual index tensors fia_padding_idx_ and fia_unpadding_idx_. Additionally, the call to build_fia_index_tensors is commented out at line 381. This implementation is incomplete and will cause the FIA feature to fail. Note that calling build_fia_index_tensors inside build_node_variant_pack may have performance implications due to repeated device allocations and H2D transfers; consider caching these tensors if the sequence lengths haven't changed.
if (is_prefill && FLAGS_enable_fia) {
build_fia_index_tensors(input_params, x.size(0));
node.variantPack.inTensors.at(input_offset++) =
atb_speed::Utils::AtTensor2Tensor(fia_padding_idx_);
node.variantPack.inTensors.at(input_offset++) =
atb_speed::Utils::AtTensor2Tensor(fia_unpadding_idx_);
}| namespace layer { | ||
|
|
||
| namespace { | ||
| constexpr int64_t kFiaMaskSeqLen = 2048; |
| namespace layer { | ||
|
|
||
| namespace { | ||
| constexpr int64_t kFiaMaskSeqLen = 2048; |
No description provided.