Variable Grouped Swizzle by int-smart · Pull Request #2914 · NVIDIA/TransformerEngine

int-smart · 2026-04-22T07:41:28Z

Description

Grouped Swizzle with variable shape. Not sure if this is needed but if not can be closed.

Fixes #2451

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

… update C++ operator interface Signed-off-by: Abhishek <[email protected]>

Signed-off-by: Abhishek <[email protected]>

for more information, see https://pre-commit.ci

greptile-apps · 2026-04-22T07:46:05Z

Greptile Summary

This PR adds a variable-shape code path to the grouped MXFP8 swizzle operation, introducing a new persistent-grid CUDA kernel (grouped_swizzle_scaling_variable_shape_kernel) that dispatches blocks across tensors of differing dimensions. Three P1 issues prevent the feature from working end-to-end:

maybe_swizzle_grouped_tensor in swizzle.cpp removes the variable-shape guard but never forwards first_dims/last_dims to swizzle_input, so all_same_shape() always returns true and the new kernel is never reached from PyTorch.
The variable-shape kernel assumes a padded-layout input buffer and lacks the compact-layout detection present in the uniform path, risking corrupted reads for multi-tensor inputs produced by the quantize kernel.
The C++ test helper build_grouped_tensor sets kNVTEGroupedTensorOffsets but not kNVTEGroupedFirstDims/kNVTEGroupedLastDims, so the test suite exercises the uniform kernel rather than the new variable-shape code path it is intended to validate.

Confidence Score: 2/5

Not safe to merge — three P1 issues mean the new variable-shape kernel is never invoked, tests do not validate it, and compact-layout inputs will corrupt output data.

Multiple independent P1 bugs: the PyTorch extension never reaches the new kernel, the kernel itself lacks compact-layout input support, and the test suite does not exercise the new code path. The feature is effectively a no-op from the PyTorch side and partially untested at the C-API level.

transformer_engine/pytorch/csrc/extensions/swizzle.cpp (missing dim forwarding), transformer_engine/common/swizzle/swizzle.cu (missing compact-layout handling in variable-shape kernel), tests/cpp/test_common.cu and tests/cpp/operator/test_swizzle.cu (test infrastructure doesn't populate first_dims/last_dims).

Important Files Changed

Filename	Overview
transformer_engine/pytorch/csrc/extensions/swizzle.cpp	Removes the variable-shape guard but doesn't forward `first_dims`/`last_dims` to `swizzle_input`, so the new variable-shape kernel is never reached from the PyTorch path — a P1 correctness bug.
transformer_engine/common/swizzle/swizzle.cu	Adds `grouped_swizzle_scaling_variable_shape_kernel` with a correct persistent-grid design; missing compact-layout support for the input buffer parallels existing uniform-path handling.
tests/cpp/operator/test_swizzle.cu	Adds variable-shape test cases but `build_grouped_tensor` never sets `kNVTEGroupedFirstDims`/`kNVTEGroupedLastDims`, so the new kernel is not actually exercised.
tests/cpp/test_common.cu	Fixes offsets vector to size `num_tensors + 1` to include a sentinel end-offset; shape dimension arrays (`first_dims`/`last_dims`) are still not populated for variable-shape grouped tensors.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["maybe_swizzle_grouped_tensor (swizzle.cpp)"] --> B["Build swizzle_input\n(num_tensors, logical_shape, scaling_mode)"]
    B --> C{"tensor_offsets set?"}
    C -- Yes --> D["set_tensor_offsets on swizzle_input"]
    C -- No --> E["❌ first_dims / last_dims\nNEVER forwarded"]
    D --> E
    E --> F["nvte_swizzle_grouped_scaling_factors\n(swizzle.cu)"]
    F --> G{"input->all_same_shape()\n= !first_dims.has_data()\n&& !last_dims.has_data()"}
    G -- "true (always, even for variable shapes)" --> H["Uniform path\nget_common_first_dim() = logical_shape[0] / num_tensors\n⚠️ wrong for variable shapes"]
    G -- "false (never reached)" --> I["Variable-shape kernel\ngrouped_swizzle_scaling_variable_shape_kernel"]
    H --> J["Wrong scale factors written to output"]
    I --> K["Correct persistent-grid dispatch\n(correct but unreachable)"]

Comments Outside Diff (1)

transformer_engine/pytorch/csrc/extensions/swizzle.cpp, line 386-397 (link)

Variable-shape dims not propagated to swizzle_input

swizzle_input is constructed from only num_tensors, logical_shape, and scaling_mode; first_dims and last_dims from the original input are never forwarded. all_same_shape() is defined as !first_dims.has_data() && !last_dims.has_data(), so inside swizzle_grouped_scaling_factors, is_variable_shape is always false regardless of what was passed in.

The uniform path then calls get_common_first_dim() → logical_shape.data[0] / num_tensors, which is an average, not a per-tensor value, producing silently wrong swizzled scale factors for any variable-shape input. Additionally, compute_padded_grouped_scale_shape (lines ~413–417) uses the same averaged per_tensor_first_dim to allocate the output buffer, giving the wrong allocation size.

The missing forwarding calls should be added after the tensor_offsets block:
```
const auto first_dims = input.get_first_dims();
if (first_dims.data_ptr != nullptr) {
    swizzle_input.set_first_dims(first_dims.data_ptr, static_cast<DType>(first_dims.dtype), first_dims.shape);
}
const auto last_dims = input.get_last_dims();
if (last_dims.data_ptr != nullptr) {
    swizzle_input.set_last_dims(last_dims.data_ptr, static_cast<DType>(last_dims.dtype), last_dims.shape);
}
```
compute_padded_grouped_scale_shape would also need to handle the variable-shape allocation correctly.

_{Reviews (5): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

greptile-apps · 2026-04-22T07:46:09Z

+      int device_id;
+      cudaGetDevice(&device_id);
+      int num_SMs;
+      cudaDeviceGetAttribute(&num_SMs, cudaDevAttrMultiProcessorCount, device_id);
+      // Find out how many blocks of this specific kernel can fit on one SM
+      int max_active_blocks_per_sm;
+      cudaOccupancyMaxActiveBlocksPerMultiprocessor(
+          &max_active_blocks_per_sm,
+          grouped_swizzle_scaling_variable_shape_kernel<SF_TILE_DIM_M, SF_TILE_DIM_K>,
+          TB_DIM * TB_DIM,  // block size
+          max_slm_size      // dynamic shared memory
+      );
+      int persistent_blocks = num_SMs * max_active_blocks_per_sm;


Unchecked CUDA API calls can silently produce zero-block launches

cudaGetDevice, cudaDeviceGetAttribute, and cudaOccupancyMaxActiveBlocksPerMultiprocessor are all called without NVTE_CHECK_CUDA. If any of these fail, max_active_blocks_per_sm is left with an indeterminate (or zero) value, making persistent_blocks = 0. Launching the persistent kernel with 0 blocks is legal in CUDA — it silently does nothing — so the output buffer stays uninitialized with no error raised.

Suggested change

int device_id;

cudaGetDevice(&device_id);

int num_SMs;

cudaDeviceGetAttribute(&num_SMs, cudaDevAttrMultiProcessorCount, device_id);

// Find out how many blocks of this specific kernel can fit on one SM

int max_active_blocks_per_sm;

cudaOccupancyMaxActiveBlocksPerMultiprocessor(

&max_active_blocks_per_sm,

grouped_swizzle_scaling_variable_shape_kernel<SF_TILE_DIM_M, SF_TILE_DIM_K>,

TB_DIM * TB_DIM, // block size

max_slm_size // dynamic shared memory

);

int persistent_blocks = num_SMs * max_active_blocks_per_sm;

int device_id;

NVTE_CHECK_CUDA(cudaGetDevice(&device_id));

int num_SMs;

NVTE_CHECK_CUDA(cudaDeviceGetAttribute(&num_SMs, cudaDevAttrMultiProcessorCount, device_id));

// Find out how many blocks of this specific kernel can fit on one SM

int max_active_blocks_per_sm;

NVTE_CHECK_CUDA(cudaOccupancyMaxActiveBlocksPerMultiprocessor(

&max_active_blocks_per_sm,

grouped_swizzle_scaling_variable_shape_kernel<SF_TILE_DIM_M, SF_TILE_DIM_K>,

TB_DIM * TB_DIM, // block size

max_slm_size // dynamic shared memory

));

NVTE_CHECK(max_active_blocks_per_sm > 0, "Occupancy query returned 0 blocks per SM.");

int persistent_blocks = num_SMs * max_active_blocks_per_sm;

greptile-apps · 2026-04-22T07:46:10Z

+  if (!is_variable_shape) {
+    // Fallback to uniform shape implementation
+    NVTE_CHECK(input->all_same_shape(), "Grouped swizzle requires uniform tensor shapes.");
+    NVTE_CHECK(input->all_same_last_dim() && input->all_same_first_dim(),
+               "Grouped swizzle requires uniform tensor shapes.");


Dead code: redundant assertions inside !is_variable_shape branch

is_variable_shape is defined as !input->all_same_shape(), so inside if (!is_variable_shape) the two NVTE_CHECK calls are tautologies — they can never fire. They add noise and could mislead future readers into thinking the branch can handle non-uniform shapes. Consider removing them or converting them to a comment.

Suggested change

if (!is_variable_shape) {

// Fallback to uniform shape implementation

NVTE_CHECK(input->all_same_shape(), "Grouped swizzle requires uniform tensor shapes.");

NVTE_CHECK(input->all_same_last_dim() && input->all_same_first_dim(),

"Grouped swizzle requires uniform tensor shapes.");

if (!is_variable_shape) {

// All tensors share the same shape; use the optimised uniform-shape path.

greptile-apps · 2026-04-22T07:46:10Z

+    if (int_stride % 2 != 0) int_stride++;
+    int* d_block_offsets = reinterpret_cast<int*>(workspace);
+    int* d_global_counter = d_block_offsets + num_tensors + 1;
+    int* d_total_blocks = d_global_counter + 1;


d_total_blocks is written but never consumed

d_total_blocks is populated by compute_grouped_swizzle_setup (as *total_blocks = current_block_offset) but is never read by the persistent kernel or any host code afterward. The persistent grid terminates via the tensor_id == -1 sentinel, not via a stored total. If this field was intended as a diagnostic or future guard, a comment would clarify its purpose; otherwise it can be removed to avoid confusing future maintainers and wasting a device-side write.

greptile-apps · 2026-04-22T07:46:11Z

+  size_t num_tensors = input.num_tensors();
+  size_t num_int_elems = num_tensors + 3;       // n+1 block_offsets + gc + tb
+  if (num_int_elems % 2 != 0) num_int_elems++;  // pad to even for size_t alignment
+  size_t workspace_size = num_int_elems * sizeof(int) + (num_tensors + 1) * sizeof(size_t);
+  workspace_size = roundup(workspace_size, 256);
+  auto workspace =
+      allocateSpace(std::vector<size_t>{workspace_size}, transformer_engine::DType::kByte, false);
+
  NVTE_SCOPED_GIL_RELEASE({
    nvte_swizzle_grouped_scaling_factors(swizzle_input.data(), swizzle_output.data(),
-                                         at::cuda::getCurrentCUDAStream());
+                                         getDataPtr(workspace), at::cuda::getCurrentCUDAStream());


Workspace allocated unconditionally even for uniform-shape inputs

The workspace is only consumed by the variable-shape code path in swizzle_grouped_scaling_factors. For uniform shapes the pointer is accepted but immediately ignored. Gating the allocation on whether variable shapes are present (e.g., first_dims.data_ptr != nullptr || last_dims.data_ptr != nullptr) would avoid a small but unnecessary device allocation on every invocation with uniform tensors. This is a performance suggestion, not a correctness issue.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

vthumbe1503 · 2026-04-22T22:43:30Z

+          cudaFuncAttributeMaxDynamicSharedMemorySize, max_slm_size));
+
+      int device_id;
+      cudaGetDevice(&device_id);


Caching the attrbues like number of sms and max active blocks per device would be ideal to reduce CPU overheads on each call.

We already have a function in transformer_engine/common/util/cuda_runtime.cpp called "sm_count". Could you please use that here?

vthumbe1503 · 2026-04-22T22:46:15Z

+    // Fallback to uniform shape implementation
+    NVTE_CHECK(input->all_same_shape(), "Grouped swizzle requires uniform tensor shapes.");
+    NVTE_CHECK(input->all_same_last_dim() && input->all_same_first_dim(),
+               "Grouped swizzle requires uniform tensor shapes.");


These checks might not be needed. Given we used input->all_same_shape() to reach this stage

vthumbe1503

I think that workspace allocation + small kernel for computing offsets + persistent kernel might be an overkill for swizzling. @int-smart Do you have some performance numbers by any chance for the swizzling kernel on Blackwell?

How about we follow a SM filling grid pattern like in grouped_bias_add kernel in this PR?
https://github.com/NVIDIA/TransformerEngine/pull/2885/changes/BASE..b64559af9b89d816b8d7ffba4f5273e556d90c8e#diff-fa75cbeb11caf588f79b811be355c8f00b0cf5d4b807c259b94f2a40ffc8db6f

With this pattern thread block id is dynamically decided based on sum(first_dims) and at the same time we divide the rows of grouped_tensor uniformly among the SMs. However it only handles variable first_dims(Need to extend the idea for other cases like all dims being variable)

int-smart · 2026-04-22T23:38:34Z

@vthumbe1503 Will check the PR and get back

int-smart · 2026-04-22T23:41:37Z

With regards to Blackwell I dont have the numbers tbh. I can generate it for RTX 40 series

Signed-off-by: Abhishek <[email protected]>

for more information, see https://pre-commit.ci

int-smart · 2026-04-25T23:48:14Z

@vthumbe1503 Consolidated to one kernel, removed shared memory allocation and tried to stick to the PR you mentioned. If this works let me know. Seems to perform better on rtx 4070 than my last approach. There are still some optimizations can be done but that would need more shared memory alloc.

vthumbe1503 · 2026-04-27T18:38:31Z

+      int num_SMs;
+      cudaDeviceGetAttribute(&num_SMs, cudaDevAttrMultiProcessorCount, device_id);
+      int max_active_blocks_per_sm;
+      cudaOccupancyMaxActiveBlocksPerMultiprocessor(


Can we add NVTE_CHECK_CUDA around these APIs.

vthumbe1503 · 2026-04-27T19:48:13Z

Thanks for the PR @int-smart. Overall LGTM. Have a few minor comments w.r.t CPU overheads and error handling.

vthumbe1503 · 2026-04-27T19:48:19Z

/te-ci

Signed-off-by: Abhishek <[email protected]>

for more information, see https://pre-commit.ci

greptile-apps · 2026-04-30T02:33:08Z

+      const size_t padded_scale_elems = padded_m * padded_k;
+      const size_t compact_scale_elems =
+          rowwise ? m * padded_k : DIVUP(k, static_cast<size_t>(MXFP8_BLOCK_SIZE)) * padded_m;
+      const size_t compact_total_scale_elems =
+          rowwise ? round_up_to_multiple(input->num_tensors * m, 128) * padded_k
+                  : round_up_to_multiple(
+                        input->num_tensors * DIVUP(k, static_cast<size_t>(MXFP8_BLOCK_SIZE)), 4) *
+                        padded_m;


Variable-shape kernel assumes padded-layout input; no compact-layout support

In the uniform path (lines 2213–2240), the code detects whether the input is in "compact" or "padded" layout and adjusts input_stride_bytes accordingly. The variable-shape kernel (grouped_swizzle_scaling_variable_shape_kernel) has no equivalent detection — current_scale_base accumulates padded_m * padded_k * scale_elem_size unconditionally, which is the padded stride. If the variable-shape input arrives in the compact layout (which the quantize kernel emits), reads for all tensors after the first will be offset to the wrong position in memory, producing corrupted scale factors with no error raised.

greptile-apps · 2026-04-30T02:33:09Z

  }

  if (!same_first || !same_last) {
-    grouped.offsets_dev = cuda_alloc<int64_t>(num_tensors * sizeof(int64_t));
+    size_t num_off = num_tensors + 1;
+    grouped.offsets_dev = cuda_alloc<int64_t>(num_off * sizeof(int64_t));
    NVTE_CHECK_CUDA(cudaMemcpy(grouped.offsets_dev.get(), offsets.data(),
-                               num_tensors * sizeof(int64_t), cudaMemcpyHostToDevice));
-    NVTEShape off_shape = nvte_make_shape(&num_tensors, 1);
+                               num_off * sizeof(int64_t), cudaMemcpyHostToDevice));
+    NVTEShape off_shape = nvte_make_shape(&num_off, 1);
    NVTEBasicTensor off_tensor{grouped.offsets_dev.get(), kNVTEInt64, off_shape};
    nvte_set_grouped_tensor_param(h, kNVTEGroupedTensorOffsets, &off_tensor, sizeof(off_tensor));


build_grouped_tensor sets offsets but not first_dims/last_dims — variable-shape test exercises uniform kernel

nvte_set_grouped_tensor_param is called for kNVTEGroupedTensorOffsets but never for kNVTEGroupedFirstDims or kNVTEGroupedLastDims. As a result, input->all_same_shape() returns true inside swizzle_grouped_scaling_factors even for the variable-shape test cases, so the new grouped_swizzle_scaling_variable_shape_kernel is never actually exercised. The test validates the uniform kernel with a wider variety of shapes rather than the new variable-shape code path it claims to cover.

kNVTEGroupedFirstDims and kNVTEGroupedLastDims need to be populated (analogous to how offsets are populated) for the variable-shape branch to be reached.

int-smart and others added 3 commits April 21, 2026 23:54

feat: add support for grouped GEMM swizzling with variable shapes and…

e362860

… update C++ operator interface Signed-off-by: Abhishek <[email protected]>

Added confirmation with uniformity in one of the dimensions

ab16ec7

Signed-off-by: Abhishek <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

16735fd

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

vthumbe1503 self-requested a review April 22, 2026 22:33

vthumbe1503 reviewed Apr 22, 2026

View reviewed changes

vthumbe1503 self-assigned this Apr 22, 2026

int-smart and others added 2 commits April 25, 2026 15:55

Using single kernel for variable m and k

749692c

Signed-off-by: Abhishek <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5836d2c

for more information, see https://pre-commit.ci

vthumbe1503 reviewed Apr 27, 2026

View reviewed changes

Merge branch 'main' into feat/variable_swizzle

56d23c0

int-smart and others added 4 commits April 29, 2026 18:25

Cached blocks per sm for device and removed redundant checks

0576070

Signed-off-by: Abhishek <[email protected]>

Merge branch 'main' into feat/variable_swizzle

32d26d8

Signed-off-by: Abhishek <[email protected]>

Updated the code with newer changes in main

04003c5

Signed-off-by: Abhishek <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

9bf8dcb

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Apr 30, 2026

View reviewed changes

Conversation

int-smart commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 left a comment

Choose a reason for hiding this comment

Uh oh!

int-smart commented Apr 22, 2026

Uh oh!

int-smart commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

int-smart commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vthumbe1503 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 commented Apr 27, 2026

Uh oh!

vthumbe1503 commented Apr 27, 2026

Uh oh!

greptile-apps Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

int-smart commented Apr 22, 2026 •

edited

Loading

greptile-apps Bot commented Apr 22, 2026 •

edited

Loading

int-smart commented Apr 22, 2026 •

edited

Loading

int-smart commented Apr 25, 2026 •

edited

Loading