[NVFP4][Dense/MoE] Integrate Cutlass NVFP4 Row-Cast-Col-RHT-Transpose-Cast Fusion Kernel #2555

zhongbozhu · 2026-01-03T01:23:23Z

Description

Note: #2558 reported a bug in #2411. Fix is here #2564: make sure you cherry-pick this one too before it's in main.

Previously, similar optimization has been applied for MOE grouped quantize with RHT in #2411. This PR targets the dense linear layers & shared experts when being quantized to NVFP4. Having this fusion means high precision input only needs to be read once while without this fusion, it needs to be read twice.

Similarly, we have env var NVTE_USE_FAST_MATH to control the numerical behavior of RHT quant fusion kernel to accelerate it further. The fast math is only applied to the high precision math so it will have minimal impact of the training convergence.

What fast-math toggle controls:

replace x / y by x * (1/y)
replace 1 / x by reciporal_approximate_ftz(x)
when RHT cast fusion is available, fusion allows nvfp4 quantize to be performed directly on FP32 data in register files, this will essentially remove a round trip between FP32 to BF16 then FP32.

Therefore, I DO recommend turn it on since it will significantly improve the RHT kernel performance.

The only reason why it's still not default open is because we want ZERO TOLERNACE test between our CUDA quantize kernels and our pytorch-based emulated quantize references. With fast math toggle turned on, it's hard to pass test with zero tolerance without further investigation of how to relax the test conditions while still providing high confidence of the test case.

TODO items:

Merge the bug fix PR 2564 first.
Some cutlass deprecating APIs are being used, output many warnings.
Maybe turn on fast math by default and use NVTE_DISABLE_RHT_FAST_MATH instead of using NVTE_USE_FAST_MATH? @timmoon10 for opinions.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-01-03T01:26:28Z

Greptile Overview

Greptile Summary

Integrates a Cutlass-based CUDA fusion kernel for NVFP4 quantization that combines rowwise casting and columnwise RHT-transpose-cast operations into a single kernel for dense linear layers and shared experts. The optimization reduces memory bandwidth by reading high-precision input once instead of twice, with eligibility detection based on BF16 input and 64×128 dimension alignment. The NVTE_USE_FAST_MATH environment variable controls fast-math optimizations for RHT kernels.

Confidence Score: 2/5

Multiple potential issues found in fusion logic and RNG state management that could cause runtime errors or incorrect results
The PR introduces complex low-level CUDA code with several concerning issues: (1) potential race condition in quantizer.cpp where separate RNG states are conditionally created but may not align with actual kernel paths, (2) inconsistent RNG state management between fused and unfused paths could lead to different random behavior, (3) the fusion eligibility check (rows%64==0 && cols%128==0) doesn't match the documented kernel requirements or align with the alignment checks in cast.cpp (128-aligned), (4) missing validation that input dimensions meet kernel constraints before calling the fusion kernel. While comprehensive tests exist, these logic issues warrant careful review before merging.
transformer_engine/pytorch/csrc/quantizer.cpp for RNG state logic and fusion eligibility checks, transformer_engine/pytorch/csrc/extensions/cast.cpp for alignment validation consistency

Important Files Changed

File Analysis

Filename	Score	Overview
transformer_engine/common/hadamard_transform/row_cast_col_hadamard_transform_cast_fusion.cu	3/5	Adds complex Cutlass-based CUDA fusion kernel for NVFP4 row-cast + column-RHT-transpose-cast; kernel uses advanced SM100 features (TMEM, TMA, cluster launch) with pipeline parallelism across MMA/DMA/epilogue warps
transformer_engine/pytorch/csrc/quantizer.cpp	2/5	Refactors NVFP4Quantizer::quantize_impl to detect fusion eligibility (BF16 + 64×128 alignment) and call new fused API; fallback path uses separate RNG states for rowwise/columnwise when unfused; adds helper quantize_with_rht_unfused_helper
transformer_engine/pytorch/csrc/extensions/cast.cpp	3/5	Updates split_quantize_nvfp4_impl_with_rht_helper to conditionally use fully-fused grouped kernel when all split_sections are 128-aligned; separate rowwise/columnwise RNG states for unfused path

Sequence Diagram

sequenceDiagram
    participant Input as BF16 Input Tensor
    participant Quantizer as NVFP4Quantizer
    participant Fusion as Fused Kernel
    participant Unfused as Unfused Path
    participant Output as NVFP4 Output

    Input->>Quantizer: "quantize_impl()"
    Quantizer->>Quantizer: "Check: BF16 & rows%64==0 & cols%128==0"
    
    alt Fusion Eligible
        Quantizer->>Fusion: "nvte_hadamard_transform_cast_fusion()"
        Fusion->>Fusion: "Single kernel: Rowwise + RHT + Colwise"
        Fusion->>Output: "Both quantizations complete"
    else Unfused Path
        alt Rowwise enabled
            Quantizer->>Unfused: "nvte_quantize_v2(rowwise)"
            Unfused->>Output: "Rowwise quantization"
        end
        alt Columnwise enabled
            Quantizer->>Unfused: "nvte_hadamard_transform()"
            Unfused->>Unfused: "Apply RHT"
            Unfused->>Unfused: "nvte_quantize_v2(columnwise)"
            Unfused->>Output: "Columnwise quantization"
        end
    end

zhongbozhu · 2026-01-03T04:16:46Z

/te-ci arm L1

greptile-apps

Additional Comments (2)

benchmarks/linear/benchmark_linear.py, line 141 (link)

logic: NVTX range is pushed but never popped in the benchmark function
transformer_engine/common/hadamard_transform/row_cast_col_hadamard_transform_cast_fusion.cu, line 346 (link)

syntax: Typo in comment: 'SMEMork' should be 'SMEM work'

_{8 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Zhongbo Zhu <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

This PR integrates a Cutlass-based fusion kernel that combines row-wise quantization and column-wise RHT (Random Hadamard Transform) + quantization + transpose operations for NVFP4 dense linear layers and shared experts. The key optimization reduces memory bandwidth by reading high-precision input data once instead of twice.

Key Changes

New Fusion Kernel (row_cast_col_hadamard_transform_cast_fusion.cu):

Implements nvte_hadamard_transform_cast_fusion API that performs both rowwise and columnwise quantization in a single pass
Uses MMA hardware for efficient Hadamard transform computation
Eligible when input is BF16 with dimensions divisible by 64×128
Reads pre-computed amax values to calculate FP8 scaling factors
Supports stochastic rounding and fast math optimization flags

Refactored Quantizer Logic (quantizer.cpp):

Moved unfused RHT path into quantize_with_rht_unfused_helper method for cleaner code organization
Improved RNG state handling: single RNG state when fusion is used, separate states for rowwise/columnwise when unfused
Added NVTE_USE_FAST_MATH environment variable support for accelerating high-precision math operations
Eligibility check moved before RNG state generation to avoid unnecessary work

Extended Test Coverage (test_nvfp4_rht_quantize_exact.py):

Added "columnwise-only" quantization mode testing alongside existing "quantize" and "quantize_transpose" modes
Tests now validate rowwise/columnwise results conditionally based on the quantization mode

Grouped Quantization Support (cast.cpp):

Split-quantize path now uses fused kernel when all tensors have 128-aligned dimensions
Bulk RNG state generation for grouped kernels (single state shared across splits)
Fast math flag propagation to all quantization configs

Architecture Notes

The fusion provides optimal performance when:

Input dtype is BF16
Rows are divisible by 64 (MMA tile requirement)
Columns are divisible by 128 (MMA tile requirement)

When these conditions aren't met, the code gracefully falls back to the unfused path with separate kernel launches for rowwise and columnwise quantization.

Confidence Score: 4/5

This PR is safe to merge with minimal risk after addressing documentation and TODO items mentioned in the PR description
Score of 4 reflects a well-engineered feature with thorough implementation. The code demonstrates good software practices: clean refactoring with extracted helper methods, proper error handling, graceful fallback paths, and comprehensive test coverage including the new columnwise-only mode. The fusion kernel follows established patterns from the grouped quantization PR #2411. Deducted 1 point due to: (1) PR author notes cutlass deprecation warnings need addressing, (2) TODOs remain about potentially defaulting fast math on, and (3) the ~1400 line CUDA kernel file has limited inline documentation for complex template logic
The main CUDA kernel file (row_cast_col_hadamard_transform_cast_fusion.cu) would benefit from additional inline comments explaining the template parameter switches and MMA computation flow, but no files have critical issues requiring immediate attention

Important Files Changed

File Analysis

Filename	Score	Overview
transformer_engine/pytorch/csrc/quantizer.cpp	4/5	Refactored NVFP4 quantize_impl to use new fused RHT cast kernel, extracted unfused helper, improved RNG state handling for fused vs unfused paths
transformer_engine/common/hadamard_transform/row_cast_col_hadamard_transform_cast_fusion.cu	4/5	New CUDA kernel implementing fused row-cast and column-RHT-transpose-cast using Cutlass MMA hardware for BF16 inputs with 64x128 alignment
transformer_engine/common/include/transformer_engine/hadamard_transform.h	5/5	Added new API function nvte_hadamard_transform_cast_fusion for dense layer quantization, marked old columnwise function for future deprecation
transformer_engine/pytorch/csrc/extensions/cast.cpp	4/5	Added NVTE_USE_FAST_MATH env var support in split_quantize for grouped NVFP4 kernels, improved RNG state setup with bulk generation flag
tests/pytorch/nvfp4/test_nvfp4_rht_quantize_exact.py	5/5	Extended test coverage to support columnwise-only quantization mode, added return_identity parameter to test all three modes

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant Quantizer as NVFP4Quantizer
    participant API as nvte_hadamard_transform_cast_fusion
    participant Kernel as row_col_rht_gemm_ntt_w_sfc
    participant AmaxKernel as nvte_hadamard_transform_amax
    
    User->>Quantizer: quantize(input, output)
    Quantizer->>Quantizer: Check eligibility (BF16, rows%64==0, cols%128==0)
    
    alt With RHT and eligible for fusion
        Quantizer->>AmaxKernel: Compute rowwise & columnwise amax
        AmaxKernel-->>Quantizer: amax values populated
        
        alt Stochastic rounding enabled
            Quantizer->>Quantizer: Generate RNG state
        end
        
        alt Fast math enabled (NVTE_USE_FAST_MATH)
            Quantizer->>Quantizer: Set use_fast_math flag
        end
        
        Quantizer->>API: Call with input, output, hadamard_matrix, quant_config
        API->>Kernel: Launch fused kernel
        
        Kernel->>Kernel: Read amax values
        Kernel->>Kernel: Perform rowwise quantization to FP4
        Kernel->>Kernel: Compute RHT using MMA hardware
        Kernel->>Kernel: Transpose and quantize to FP4
        Kernel->>Kernel: Write FP8 scales
        
        Kernel-->>API: Complete
        API-->>Quantizer: Return
        
    else Not eligible for fusion
        Quantizer->>AmaxKernel: Compute amax
        AmaxKernel-->>Quantizer: amax values
        
        alt Rowwise usage
            Quantizer->>Quantizer: Call nvte_quantize_v2 for rowwise
        end
        
        alt Columnwise usage
            Quantizer->>Quantizer: Call nvte_hadamard_transform for RHT
            Quantizer->>Quantizer: Call nvte_quantize_v2 for columnwise
        end
    end
    
    Quantizer-->>User: Quantized output

Signed-off-by: Zhongbo Zhu <[email protected]>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/csrc/quantizer.cpp

Signed-off-by: Zhongbo Zhu <[email protected]>

greptile-apps

_{7 files reviewed, 7 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-10T01:16:08Z

transformer_engine/pytorch/csrc/quantizer.cpp

+  bool eligible_for_rht_cast_fusion =
+      input.dtype() == DType::kBFloat16 && rows % 64 == 0 && cols % 128 == 0;


Fusion eligibility check uses rows % 64 == 0 && cols % 128 == 0 but this doesn't match the alignment check in cast.cpp line 815 which uses split_section % 128 == 0. The mismatched alignment requirements could cause the fusion kernel to be called with incompatible dimensions. Should this be rows % 128 == 0 && cols % 128 == 0 to match the grouped kernel requirements?

greptile-apps · 2026-01-10T01:16:09Z

transformer_engine/pytorch/csrc/quantizer.cpp

+  const bool need_separate_columnwise_rng = this->stochastic_rounding && this->with_rht &&
+                                            this->columnwise_usage &&
+                                            (!eligible_for_rht_cast_fusion);


The logic for need_separate_columnwise_rng checks !eligible_for_rht_cast_fusion, but the actual kernel path selection happens later at line 1673. If the fusion path is taken, both rowwise and columnwise quantization use the same RNG state (from quant_config), but if unfused, they use separate states. This could cause non-deterministic behavior when conditions change slightly (e.g., input shape changes by a few elements). Consider generating separate RNG states unconditionally when both rowwise and columnwise are enabled, then have the fusion kernel use both appropriately.

greptile-apps · 2026-01-10T01:16:10Z

transformer_engine/pytorch/csrc/quantizer.cpp

+    if (eligible_for_rht_cast_fusion) {
+      // fusion kernel requires passing in RHT matrix directly for maximum performance
+      auto rht_matrix_nvte = makeTransformerEngineTensor(this->rht_matrix);
+      // Fusion kernel that does the following:
+      // 1. Rowwise quantization
+      // 2. RHT followed by columnwise quantization & transpose
+      NVTE_SCOPED_GIL_RELEASE({
+        nvte_hadamard_transform_cast_fusion(input.data(), out.data(), rht_matrix_nvte.data(),
+                                            quant_config, stream);
+      });


Missing validation that input dimensions meet fusion kernel requirements before calling nvte_hadamard_transform_cast_fusion. The kernel expects dimensions aligned to specific tile sizes (128×128 based on the kernel code), but there's no check that rows and cols satisfy these constraints. If dimensions are misaligned, the kernel could produce incorrect results or crash.

greptile-apps · 2026-01-10T01:16:14Z

transformer_engine/common/hadamard_transform/row_cast_col_hadamard_transform_cast_fusion.cu

+  // Assert checks problem size should be multiple of 64
+  assert(M % 64 == 0);
+  assert(N % 64 == 0);


The assertions assert(M % 64 == 0) and assert(N % 64 == 0) in the kernel launch function will cause program termination in release builds where asserts are disabled. For production code handling user inputs, these should be replaced with proper error checking using NVTE_CHECK that works in all build configurations.

greptile-apps · 2026-01-10T01:16:15Z

transformer_engine/common/hadamard_transform/row_cast_col_hadamard_transform_cast_fusion.cu

+
+  // Allocate SMEMork


Typo in comment: "SMEMork" should be "SMEM work" or "SMEM fork"

greptile-apps · 2026-01-10T01:16:20Z

Additional Comments (2)

transformer_engine/pytorch/csrc/extensions/cast.cpp
The alignment check split_section % 128 == 0 for fusion eligibility should be documented or extracted as a constant. This magic number appears in multiple places (quantizer.cpp line 1552 uses different values) and the inconsistency suggests potential bugs. Consider defining constexpr size_t NVFP4_FUSION_ROW_ALIGNMENT = 128; to centralize this requirement.

transformer_engine/pytorch/csrc/quantizer.cpp
When need_separate_columnwise_rng is true, a new RNG state is generated for columnwise quantization. However, the philox generator state is advanced for each call, which means the random sequences will differ between runs if the fusion eligibility changes (e.g., due to shape variations). For reproducibility across different code paths, consider using a deterministic offset based on tensor properties rather than sequential generator advancement.

zhongbozhu requested a review from timmoon10 January 3, 2026 01:23

zhongbozhu self-assigned this Jan 3, 2026

zhongbozhu added the fp4 label Jan 3, 2026

zhongbozhu changed the title ~~[NVFP4][Dense] Integrate Cutlass NVFP4 Row-Cast-Col-RHT-Transpose Fusion Kernel~~ [NVFP4][Dense] Integrate Cutlass NVFP4 Row-Cast-Col-RHT-Transpose-Cast Fusion Kernel Jan 3, 2026

zhongbozhu force-pushed the zhongbo/dense_row_col_rht_fp4_quant branch from c80932f to fc42825 Compare January 3, 2026 04:16

greptile-apps bot reviewed Jan 3, 2026

View reviewed changes

zhongbozhu added the MoE label Jan 6, 2026

zhongbozhu changed the title ~~[NVFP4][Dense] Integrate Cutlass NVFP4 Row-Cast-Col-RHT-Transpose-Cast Fusion Kernel~~ [NVFP4][Dense/MoE] Integrate Cutlass NVFP4 Row-Cast-Col-RHT-Transpose-Cast Fusion Kernel, Fixing NVFP4 Group Quant Bug Jan 6, 2026

zhongbozhu mentioned this pull request Jan 6, 2026

NVFP4 performance regression in TE main branch #2558

Closed

zhongbozhu changed the title ~~[NVFP4][Dense/MoE] Integrate Cutlass NVFP4 Row-Cast-Col-RHT-Transpose-Cast Fusion Kernel, Fixing NVFP4 Group Quant Bug~~ [NVFP4][Dense/MoE] Integrate Cutlass NVFP4 Row-Cast-Col-RHT-Transpose-Cast Fusion Kernel Jan 6, 2026

zhongbozhu added 6 commits January 9, 2026 15:12

first draft

5543f71

Signed-off-by: Zhongbo Zhu <[email protected]>

pass numerical unit test

9f20a80

Signed-off-by: Zhongbo Zhu <[email protected]>

format

862e8c0

Signed-off-by: Zhongbo Zhu <[email protected]>

add benchmark script

6b58144

Signed-off-by: Zhongbo Zhu <[email protected]>

lint and format

de6057b

Signed-off-by: Zhongbo Zhu <[email protected]>

compile guard

6ea9dab

Signed-off-by: Zhongbo Zhu <[email protected]>

zhongbozhu force-pushed the zhongbo/dense_row_col_rht_fp4_quant branch from 2bc695e to 6ea9dab Compare January 9, 2026 23:14

greptile-apps bot reviewed Jan 9, 2026

View reviewed changes

warning fix

1a06f3d

Signed-off-by: Zhongbo Zhu <[email protected]>

greptile-apps bot reviewed Jan 10, 2026

View reviewed changes

transformer_engine/pytorch/csrc/quantizer.cpp Outdated Show resolved Hide resolved

resolve greptile comment

8ac40c9

Signed-off-by: Zhongbo Zhu <[email protected]>

greptile-apps bot reviewed Jan 10, 2026

View reviewed changes

		bool eligible_for_rht_cast_fusion =
		input.dtype() == DType::kBFloat16 && rows % 64 == 0 && cols % 128 == 0;

[NVFP4][Dense/MoE] Integrate Cutlass NVFP4 Row-Cast-Col-RHT-Transpose-Cast Fusion Kernel #2555

Are you sure you want to change the base?

[NVFP4][Dense/MoE] Integrate Cutlass NVFP4 Row-Cast-Col-RHT-Transpose-Cast Fusion Kernel #2555

Conversation

zhongbozhu commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

zhongbozhu commented Jan 3, 2026

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (2)

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Key Changes

Architecture Notes

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhongbozhu commented Jan 3, 2026 •

edited

Loading

greptile-apps bot commented Jan 3, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading