Skip to content

feat: support FIA for qwen model on npu device.#1147

Open
sanlio36 wants to merge 12 commits intojd-opensource:mainfrom
sanlio36:dev_qwen_fia
Open

feat: support FIA for qwen model on npu device.#1147
sanlio36 wants to merge 12 commits intojd-opensource:mainfrom
sanlio36:dev_qwen_fia

Conversation

@sanlio36
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces FIA (Fused Interlayer Attention) support for NPU-based Qwen2, Qwen3, and Qwen3-MoE decoder layers by adding a global configuration flag and implementing the required mask and index tensor logic. Review feedback highlights critical bugs where pointers are dereferenced without initialization, potentially causing runtime crashes. Additionally, the implementation is currently limited by a hardcoded sequence length of 2048 for FIA masks, and the Qwen2 variant pack incorrectly uses placeholders instead of the intended index tensors.

ModelInputParams& input_params,
bool is_prefill) {
if (is_prefill) {
*prefill_param_.bs = std::max(1, input_params.num_sequences);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The pointer prefill_param_.bs is dereferenced here without being initialized to point to valid memory. This will cause a null pointer dereference or memory corruption at runtime. Ensure that bs is properly allocated (e.g., in the constructor) before this assignment.

bool is_prefill,
int node_id) {
if (is_prefill) {
*prefill_param_.bs = std::max(1, input_params.num_sequences);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The pointer prefill_param_.bs is dereferenced without initialization. This is a critical issue that will lead to a crash at runtime.

const ModelInputParams& input_params,
bool is_prefill) {
if (is_prefill) {
*prefill_param_.bs = std::max(1, input_params.num_sequences);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The pointer prefill_param_.bs is dereferenced without initialization. This will cause a null pointer dereference.

namespace layer {

namespace {
constexpr int64_t kFiaMaskSeqLen = 2048;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hardcoding kFiaMaskSeqLen to 2048 limits the maximum sequence length supported when FIA is enabled. If the input sequence length exceeds 2048, the fia_attn_mask_ (initialized at line 229) will have incorrect dimensions for the attention operation, potentially leading to incorrect results or crashes. This should be made dynamic or set to a sufficiently large value supported by the model's context window.

Comment on lines +409 to +414
if (is_prefill && FLAGS_enable_fia) {
node.variantPack.inTensors.at(input_offset++) = // bsnd padding_idx
placeholder_;
node.variantPack.inTensors.at(input_offset++) = // bsnd unpadding_idx
placeholder_;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The FIA index tensors are not being correctly passed to the variant pack; placeholder_ is used instead of the actual index tensors fia_padding_idx_ and fia_unpadding_idx_. Additionally, the call to build_fia_index_tensors is commented out at line 381. This implementation is incomplete and will cause the FIA feature to fail. Note that calling build_fia_index_tensors inside build_node_variant_pack may have performance implications due to repeated device allocations and H2D transfers; consider caching these tensors if the sequence lengths haven't changed.

  if (is_prefill && FLAGS_enable_fia) {
    build_fia_index_tensors(input_params, x.size(0));
    node.variantPack.inTensors.at(input_offset++) = 
        atb_speed::Utils::AtTensor2Tensor(fia_padding_idx_);
    node.variantPack.inTensors.at(input_offset++) = 
        atb_speed::Utils::AtTensor2Tensor(fia_unpadding_idx_);
  }

namespace layer {

namespace {
constexpr int64_t kFiaMaskSeqLen = 2048;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hardcoding kFiaMaskSeqLen to 2048 will cause failures for sequences longer than this limit when FIA is enabled, as the attention mask will have incorrect dimensions.

namespace layer {

namespace {
constexpr int64_t kFiaMaskSeqLen = 2048;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The hardcoded kFiaMaskSeqLen of 2048 limits the functionality of FIA to short sequences and will cause issues if the input exceeds this length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant