[Common] Persistent Grouped NVFP4 quantization kernel by Oleg-Goncharov · Pull Request #2743 · NVIDIA/TransformerEngine

Oleg-Goncharov · 2026-03-06T16:27:05Z

Description

This PR adds a persistent grouped NVFP4 quantization + transpose kernel with static scheduling.
It is built on top of the PR#2738 [Common] Persistent Grouped MXFP8 quantization kernel

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Added persistent grouped kernel
Added test suite

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

transformer_engine/common/cast/dispatch/quantize.cuh

transformer_engine/common/cast/nvfp4/specialized/group_quantize_transpose_nvfp4_tuned_1D.cuh

greptile-apps · 2026-03-09T10:18:28Z

transformer_engine/common/cast/nvfp4/specialized/group_quantize_transpose_nvfp4_tuned_1D.cuh

+    const size_t tensor_id, const ShapeRepresentation shape_rep, const size_t last_logical_dim,
+    const int64_t *const __restrict__ last_dims_ptr) {
+  size_t cols_num = 0;
+  switch (shape_rep) {
+    case ShapeRepresentation::SAME_BOTH_DIMS:
+    case ShapeRepresentation::VARYING_FIRST_DIM:
+      cols_num = last_logical_dim;
+      break;
+    case ShapeRepresentation::VARYING_LAST_DIM:
+    case ShapeRepresentation::VARYING_BOTH_DIMS:
+      cols_num = static_cast<size_t>(last_dims_ptr[tensor_id]);
+      if (cols_num % 128 != 0) {
+        NVTE_DEVICE_ERROR(
+            "For non-single tensors, the last dimension of each tensor in a group "
+            "must be divisible by 128.");
+      }


is_job_valid skips intra-tensor boundary check for non-SAME_BOTH_DIMS shapes

In the NVFP4 is_job_valid, once block_global_offset < tensor_end_offset is verified, the function returns true without checking whether the block's Y/X coordinates actually fall within [0, rows) and [0, cols):

const size_t tensor_end_offset = static_cast<size_t>(offsets_ptr[job.tensor_id + 1]); if (job.block_global_offset >= tensor_end_offset) { return false; } return true;

The corresponding check in the MXFP8 version (group_quantize_mxfp8.cuh) also validates:

const size_t tensor_offset_from_start = job.block_global_offset - tensor_start_offset; const size_t block_offset_Y_in_tensor = tensor_offset_from_start / job.cols; const size_t block_offset_X_in_tensor = tensor_offset_from_start % job.cols; if (block_offset_Y_in_tensor >= job.rows || block_offset_X_in_tensor >= job.cols) { return false; }

For VARYING_LAST_DIM and VARYING_BOTH_DIMS shapes, omitting this check could allow stale or padding blocks (that are within tensor_end_offset but beyond the actual rows × cols footprint) to issue TMA loads from out-of-bounds addresses. Please consider adding the equivalent bounds check.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

tests/cpp/operator/test_cast_nvfp4_transpose_grouped.cu

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

greptile-apps · 2026-03-09T11:52:30Z

tests/cpp/operator/test_cast_nvfp4_transpose_grouped.cu

+    std::vector<fp4e2m1> out_data_rowwise_h(total_elts / 2);
+    std::vector<fp4e2m1> out_data_colwise_h(total_elts / 2);
+    std::vector<fp8e4m3> out_scales_rowwise_h(rowwise_scales_num);
+    std::vector<fp8e4m3> out_scales_colwise_h(colwise_scales_num);


Wrong variable used in "more mismatches" condition

mismatch_messages is only appended while total_mismatches <= max_mismatches_to_print (3), so its size() can never exceed max_mismatches_to_print. The condition is therefore always false and the "... and X more mismatches" line is dead code — even inside the print_detailed_summary branch. The comparison should use total_mismatches:

Suggested change

std::vector<fp8e4m3> out_scales_colwise_h(colwise_scales_num);

if (total_mismatches > max_mismatches_to_print) {

greptile-apps · 2026-03-09T11:52:31Z

tests/cpp/operator/test_cast_nvfp4_transpose_grouped.cu

+    cudaMemcpy(last_dims_d, last_dims_h.data(), num_tensors * sizeof(int64_t), cudaMemcpyHostToDevice);
+    cudaMemcpy(offsets_d, offsets_h.data(), (num_tensors + 1) * sizeof(int64_t), cudaMemcpyHostToDevice);
+
+    cudaMemset(out_data_rowwise_d, 0, out_data_size);
+    cudaMemset(out_data_colwise_d, 0, out_data_size);
+    cudaMemset(out_scales_rowwise_d, 0, rowwise_scales_size);
+    cudaMemset(out_scales_colwise_d, 0, colwise_scales_size);
+
+    NVTEShape logical_shape_ = nvte_make_shape(logical_shape.data(), logical_shape.size());


CUDA API return values are not checked

All cudaMalloc, cudaMemcpy, and cudaMemset calls in performTest silently ignore their return values. A failed allocation would leave the pointer uninitialized (or null) and the test would proceed, likely crashing or producing a spurious cudaGetLastError failure that obscures the real problem.

Consider wrapping the calls with a helper that asserts success, e.g.:

ASSERT_EQ(cudaMalloc((void**)&in_data_d, in_data_size), cudaSuccess); ASSERT_EQ(cudaMemcpy(in_data_d, grouped_input.data(), in_data_size, cudaMemcpyHostToDevice), cudaSuccess);

This pattern applies to all CUDA API calls from the allocation block down through the cudaMemset calls (lines 388–410).

greptile-apps · 2026-03-09T11:52:32Z

transformer_engine/common/include/transformer_engine/cast.h

 *  \param[in]     stream           CUDA stream used for the operation.
 */


Binary-incompatible API change without a breaking-change marker

nvte_group_quantize_dbias (and the five related nvte_group_quantize_dbias_d* functions) previously accepted NVTETensor dbias; this PR changes the parameter to NVTEGroupedTensor dbias. Any existing C/C++ caller that was compiled against the old header will silently pass the wrong type at runtime. Even though both types are opaque pointers at the ABI level, callers that stored the dbias as NVTETensor will need to migrate.

The PR description marks this as "New feature (non-breaking change)", but this signature change will break downstream callers (Python bindings, external C++ users) that previously compiled against NVTETensor dbias. It is worth auditing all internal call sites (Python pybind layer, etc.) and explicitly documenting the migration in the PR / changelog.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

transformer_engine/common/cast/nvfp4/specialized/group_quantize_transpose_nvfp4_tuned_1D.cuh

tests/cpp/operator/test_cast_nvfp4_transpose_grouped.cu

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

transformer_engine/common/cast/dispatch/quantize.cuh

greptile-apps · 2026-03-09T14:59:12Z

transformer_engine/common/cast/core/common.cuh

+
+  const size_t rows = tensor_rows / chunk_dim_Y;
+  const size_t cols = last_logical_dim;
+
+  const size_t dbias_in_offset_Y =
+      (shape_rep == ShapeRepresentation::SAME_BOTH_DIMS)
+          ? (tensor_id * (tensor_rows / chunk_dim_Y))
+          : (static_cast<size_t>(offsets_ptr[tensor_id]) / cols / chunk_dim_Y);
+
+  const size_t thread_id = blockIdx.x * blockDim.x + threadIdx.x;
+
+  if (thread_id * nvec >= cols) {


group_reduce_dbias_kernel uses last_logical_dim as cols for all shape representations

cols is unconditionally set to last_logical_dim:

const size_t cols = last_logical_dim;

For VARYING_LAST_DIM and VARYING_BOTH_DIMS shapes, each tensor has a different last dimension. Using the scalar last_logical_dim for all tensors will produce incorrect partial-dbias strides and wrong output write offsets (tensor_id * cols assumes uniform column counts). The same issue affects the dbias_in_offset_Y calculation for those shape representations.

cast.h documents that "Grouped dbias is not yet supported for grouped tensors with a varying last dimension," but there is no runtime guard in grouped_reduce_dbias or this kernel to enforce that. If called with such shapes the function silently corrupts memory. Consider adding an explicit NVTE_CHECK(shape_rep != ShapeRepresentation::VARYING_LAST_DIM && shape_rep != ShapeRepresentation::VARYING_BOTH_DIMS, ...) guard in grouped_reduce_dbias before the kernel launch.

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

transformer_engine/common/cast/dispatch/quantize.cuh

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-03-09T16:43:41Z

transformer_engine/common/include/transformer_engine/cast.h

 void nvte_group_quantize_dbias(const NVTEGroupedTensor input, NVTEGroupedTensor output,
-                               NVTETensor dbias, NVTETensor workspace, cudaStream_t stream);
+                               NVTEGroupedTensor dbias, NVTETensor workspace, cudaStream_t stream);


The six nvte_group_quantize_dbias* functions (this one and nvte_group_quantize_dbias_dgelu, nvte_group_quantize_dbias_dsilu, nvte_group_quantize_dbias_drelu) now take NVTEGroupedTensor dbias instead of NVTETensor dbias. This is a signature change that may affect existing C/C++ callers compiled against the old header. While this appears intentional as part of the grouped tensor API consolidation, consider auditing Python bindings and any external C++ code to ensure compatibility, and explicitly document the migration path in the changelog.

greptile-apps · 2026-03-09T16:43:42Z

transformer_engine/common/cast/nvfp4/specialized/group_quantize_transpose_nvfp4_tuned_1D.cuh

+
+  const bool use_single_work_grid = (shape_rep == ShapeRepresentation::SAME_BOTH_DIMS ||
+                                     shape_rep == ShapeRepresentation::VARYING_FIRST_DIM);
+
+  const size_t first_logical_dim = input->logical_shape.data[0];
+  const size_t last_logical_dim = input->logical_shape.data[1];
+  const size_t elts_total = first_logical_dim * last_logical_dim;
+  const size_t num_tensors = input->num_tensors;
+
+  NVTE_CHECK(num_tensors <= MAX_SUPPORTED_TENSOR_DESCRIPTORS,
+             "Number of tensors in a group is larger than the MAX number of supported "
+             "descriptors (64).");


For the SAME_BOTH_DIMS case, the kernel computes per-tensor row counts via integer division (first_logical_dim / num_tensors), which silently truncates if first_logical_dim is not exactly divisible by num_tensors. This causes incorrect base offsets and may skip or overwrite the last few rows. Add a host-side check:

if (shape_rep == ShapeRepresentation::SAME_BOTH_DIMS) { NVTE_CHECK(first_logical_dim % num_tensors == 0, "For SAME_BOTH_DIMS, first_logical_dim (", first_logical_dim, ") must be divisible by num_tensors (", num_tensors, ")."); }

tests/cpp/operator/test_cast_nvfp4_transpose_grouped.cu

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

Oleg-Goncharov and others added 30 commits February 27, 2026 15:53

Implemented the kernel with split dbias

c7c1a76

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

7abbc7b

for more information, see https://pre-commit.ci Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f820b21

for more information, see https://pre-commit.ci Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Relaxed constraints on the last dimension

0c05632

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Added notes on group tensor restrictions into documentation

4a85dea

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Fixes per the review

aedd53d

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Fixed pointer

38288b1

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

More fixes

ce3a137

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Fixed kernel grid size

bddd804

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Merge branch 'main' into pr_split_dbias

a894d1a

Enabled persistency with WorkID Query feature

87352bd

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Added a struct with tunable parameters

e23f553

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Added persistency with static scheduling

d185299

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Fixed test cases

5e15f57

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Ready for benchmarking

98e9558

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Fixed out-of-boundary error

ab816cb

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Tuned kernel parameters

8a429ad

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Refactoring

ab3f911

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Refactoring 2

92720ac

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Refactoring 3

46d9811

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Removed the dynamic (WorkID Query) persistency

7172400

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Ready for PR

4344627

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

ede33b4

for more information, see https://pre-commit.ci

Merge branch 'main' into pr_persistent_grouped_mxfp8_kernel

219e925

Fixes per the review

325181b

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Merge branch 'main' into pr_persistent_grouped_mxfp8_kernel

04609b1

[pre-commit.ci] auto fixes from pre-commit.com hooks

5815335

for more information, see https://pre-commit.ci

Added the test suite

0bd837c

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Initial kernel draft

0c5849c

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Refactoring

178a7c4

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

NVIDIA deleted a comment from greptile-apps bot Mar 9, 2026

Merge branch 'main' into pr_persistent_grouped_nvfp4_kernel

c7fba6c

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes

Oleg-Goncharov added 2 commits March 9, 2026 10:21

Conditionally print the detailed unit tests summary

f7a00ce

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Fix

1e926d5

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes

tests/cpp/operator/test_cast_nvfp4_transpose_grouped.cu Show resolved Hide resolved

Cache scales base offsets

47be9b2

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes

Fixes per the review

fef9220

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov force-pushed the pr_persistent_grouped_nvfp4_kernel branch from 97ec071 to fef9220 Compare March 9, 2026 11:55

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes

Fixed NVFP4 numerics

9e37b4c

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov force-pushed the pr_persistent_grouped_nvfp4_kernel branch from 2e289c9 to 9e37b4c Compare March 9, 2026 14:51

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes

Fixed the number of launch threads per block

6a7409d

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov force-pushed the pr_persistent_grouped_nvfp4_kernel branch from f6b5928 to 6a7409d Compare March 9, 2026 15:05

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes

transformer_engine/common/cast/dispatch/quantize.cuh Outdated Show resolved Hide resolved

transformer_engine/common/cast/dispatch/quantize.cuh Outdated Show resolved Hide resolved

Oleg-Goncharov added 2 commits March 9, 2026 15:36

Numerics fix 2

1f1ab92

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Added Quantize Configs to grouped Qauntization

6c5cc7f

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov force-pushed the pr_persistent_grouped_nvfp4_kernel branch from 50a4921 to 6c5cc7f Compare March 9, 2026 16:05

Uncommented code

f5e2ba0

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov force-pushed the pr_persistent_grouped_nvfp4_kernel branch from c83b558 to f5e2ba0 Compare March 9, 2026 16:06

[pre-commit.ci] auto fixes from pre-commit.com hooks

ab45c1c

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes

Oleg-Goncharov added 2 commits March 9, 2026 17:52

Fix of logic

273b358

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Test suite fix

eace4a6

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov force-pushed the pr_persistent_grouped_nvfp4_kernel branch from 811a146 to eace4a6 Compare March 9, 2026 17:59

[pre-commit.ci] auto fixes from pre-commit.com hooks

47b350c

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Common] Persistent Grouped NVFP4 quantization kernel#2743

[Common] Persistent Grouped NVFP4 quantization kernel#2743
Oleg-Goncharov wants to merge 49 commits intoNVIDIA:mainfrom
Oleg-Goncharov:pr_persistent_grouped_nvfp4_kernel

Oleg-Goncharov commented Mar 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 9, 2026

Uh oh!

Uh oh!

greptile-apps bot Mar 9, 2026

Uh oh!

greptile-apps bot Mar 9, 2026

Uh oh!

greptile-apps bot Mar 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 9, 2026

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 9, 2026

Uh oh!

greptile-apps bot Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	std::vector<fp8e4m3> out_scales_colwise_h(colwise_scales_num);
	if (total_mismatches > max_mismatches_to_print) {

Conversation

Oleg-Goncharov commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Oleg-Goncharov commented Mar 6, 2026 •

edited

Loading