feat: add DeepSeek-V4 NPU Sparse Attention (SAS) and Lightning Indexer (LI) patches#221
feat: add DeepSeek-V4 NPU Sparse Attention (SAS) and Lightning Indexer (LI) patches#2210hujun wants to merge 16 commits into
Conversation
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request introduces NPU acceleration patches for DeepSeek-V4, adding Sparse Attention Shared-KV (SAS) and Lightning Indexer (LI) monkey-patches using mindspeed operators, along with a training script and documentation. Feedback highlights critical issues in the training script, including an undefined save_checkpoint function and an incorrectly configured DataLoader that lacks device_mesh for distributed training. Additionally, robustness improvements are recommended in the kernel patches, such as handling potential None values for sparse indices and catching broader exceptions to ensure reliable fallbacks to standard PyTorch implementations.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…r (LI) patchesUpdate src/twinkle/kernel/deepseek_v4_npu.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…r (LI) patchesUpdate src/twinkle/kernel/deepseek_v4_npu.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…r (LI) patchesUpdate src/twinkle/kernel/deepseek_v4_npu.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
feat: add DeepSeek-V4 NPU Sparse Attention (SAS) and Lightning Indexer (LI) patches
Add monkey-patch support for DeepSeek-V4 NPU accelerated attention and indexer
kernels via mindspeed, without modifying the transformers source code.
Changes
New
src/twinkle/kernel/deepseek_v4_npu.py: Core patch implementation_patched_attention_forward: ReplacesDeepseekV4Attention.forwardwithmindspeed.ops.npu_sparse_attn_shared_kv.SparseAttnSharedKVfused kernel.Supports all three layer types: sliding_attention, CSA, and HCA.
_patched_indexer_forward: ReplacesDeepseekV4Indexer.forwardwithmindspeed.ops.npu_lightning_indexerfor NPU-accelerated top-k selection.ImportErrorfallback to original implementations.Modified
src/twinkle/kernel/monkey_patch_npu.py: Registration and control_apply_deepseek_v4_npu_patch()called fromapply_npu_patch().TWINKLE_NPU_DSV4_SASandTWINKLE_NPU_DSV4_LI.ValueErrorif both SAS and LI are enabled simultaneously.config.architectures.New
cookbook/transformers/deepseek_v4_patch/README.md: Documentation withdependency list, env var reference, and usage examples.
Environment Variables
TWINKLE_NPU_DSV4_SAS0TWINKLE_NPU_DSV4_LI0SAS and LI cannot be enabled at the same time.
Dependencies
mindspeed: ProvidesSparseAttnSharedKVandnpu_lightning_indexerNPU opstorch_npu: Ascend NPU runtimetransformers: Must include DeepSeek-V4 model supportTesting
Verified on Ascend A3 with DeepSeek-V4-Flash-BF16, 4 layers,
8-card EP, gradient checkpointing enabled:
Time cost:
Usage
see README.md
cooperate with @meichangsu1