[Common] Persistent Grouped NVFP4 quantization kernel#2743
[Common] Persistent Grouped NVFP4 quantization kernel#2743Oleg-Goncharov wants to merge 49 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
transformer_engine/common/cast/nvfp4/specialized/group_quantize_transpose_nvfp4_tuned_1D.cuh
Outdated
Show resolved
Hide resolved
transformer_engine/common/cast/nvfp4/specialized/group_quantize_transpose_nvfp4_tuned_1D.cuh
Outdated
Show resolved
Hide resolved
| const size_t tensor_id, const ShapeRepresentation shape_rep, const size_t last_logical_dim, | ||
| const int64_t *const __restrict__ last_dims_ptr) { | ||
| size_t cols_num = 0; | ||
| switch (shape_rep) { | ||
| case ShapeRepresentation::SAME_BOTH_DIMS: | ||
| case ShapeRepresentation::VARYING_FIRST_DIM: | ||
| cols_num = last_logical_dim; | ||
| break; | ||
| case ShapeRepresentation::VARYING_LAST_DIM: | ||
| case ShapeRepresentation::VARYING_BOTH_DIMS: | ||
| cols_num = static_cast<size_t>(last_dims_ptr[tensor_id]); | ||
| if (cols_num % 128 != 0) { | ||
| NVTE_DEVICE_ERROR( | ||
| "For non-single tensors, the last dimension of each tensor in a group " | ||
| "must be divisible by 128."); | ||
| } |
There was a problem hiding this comment.
is_job_valid skips intra-tensor boundary check for non-SAME_BOTH_DIMS shapes
In the NVFP4 is_job_valid, once block_global_offset < tensor_end_offset is verified, the function returns true without checking whether the block's Y/X coordinates actually fall within [0, rows) and [0, cols):
const size_t tensor_end_offset = static_cast<size_t>(offsets_ptr[job.tensor_id + 1]);
if (job.block_global_offset >= tensor_end_offset) {
return false;
}
return true;The corresponding check in the MXFP8 version (group_quantize_mxfp8.cuh) also validates:
const size_t tensor_offset_from_start = job.block_global_offset - tensor_start_offset;
const size_t block_offset_Y_in_tensor = tensor_offset_from_start / job.cols;
const size_t block_offset_X_in_tensor = tensor_offset_from_start % job.cols;
if (block_offset_Y_in_tensor >= job.rows || block_offset_X_in_tensor >= job.cols) {
return false;
}For VARYING_LAST_DIM and VARYING_BOTH_DIMS shapes, omitting this check could allow stale or padding blocks (that are within tensor_end_offset but beyond the actual rows × cols footprint) to issue TMA loads from out-of-bounds addresses. Please consider adding the equivalent bounds check.
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
| std::vector<fp4e2m1> out_data_rowwise_h(total_elts / 2); | ||
| std::vector<fp4e2m1> out_data_colwise_h(total_elts / 2); | ||
| std::vector<fp8e4m3> out_scales_rowwise_h(rowwise_scales_num); | ||
| std::vector<fp8e4m3> out_scales_colwise_h(colwise_scales_num); |
There was a problem hiding this comment.
Wrong variable used in "more mismatches" condition
mismatch_messages is only appended while total_mismatches <= max_mismatches_to_print (3), so its size() can never exceed max_mismatches_to_print. The condition is therefore always false and the "... and X more mismatches" line is dead code — even inside the print_detailed_summary branch. The comparison should use total_mismatches:
| std::vector<fp8e4m3> out_scales_colwise_h(colwise_scales_num); | |
| if (total_mismatches > max_mismatches_to_print) { |
| cudaMemcpy(last_dims_d, last_dims_h.data(), num_tensors * sizeof(int64_t), cudaMemcpyHostToDevice); | ||
| cudaMemcpy(offsets_d, offsets_h.data(), (num_tensors + 1) * sizeof(int64_t), cudaMemcpyHostToDevice); | ||
|
|
||
| cudaMemset(out_data_rowwise_d, 0, out_data_size); | ||
| cudaMemset(out_data_colwise_d, 0, out_data_size); | ||
| cudaMemset(out_scales_rowwise_d, 0, rowwise_scales_size); | ||
| cudaMemset(out_scales_colwise_d, 0, colwise_scales_size); | ||
|
|
||
| NVTEShape logical_shape_ = nvte_make_shape(logical_shape.data(), logical_shape.size()); |
There was a problem hiding this comment.
CUDA API return values are not checked
All cudaMalloc, cudaMemcpy, and cudaMemset calls in performTest silently ignore their return values. A failed allocation would leave the pointer uninitialized (or null) and the test would proceed, likely crashing or producing a spurious cudaGetLastError failure that obscures the real problem.
Consider wrapping the calls with a helper that asserts success, e.g.:
ASSERT_EQ(cudaMalloc((void**)&in_data_d, in_data_size), cudaSuccess);
ASSERT_EQ(cudaMemcpy(in_data_d, grouped_input.data(), in_data_size, cudaMemcpyHostToDevice), cudaSuccess);This pattern applies to all CUDA API calls from the allocation block down through the cudaMemset calls (lines 388–410).
| * \param[in] stream CUDA stream used for the operation. | ||
| */ |
There was a problem hiding this comment.
Binary-incompatible API change without a breaking-change marker
nvte_group_quantize_dbias (and the five related nvte_group_quantize_dbias_d* functions) previously accepted NVTETensor dbias; this PR changes the parameter to NVTEGroupedTensor dbias. Any existing C/C++ caller that was compiled against the old header will silently pass the wrong type at runtime. Even though both types are opaque pointers at the ABI level, callers that stored the dbias as NVTETensor will need to migrate.
The PR description marks this as "New feature (non-breaking change)", but this signature change will break downstream callers (Python bindings, external C++ users) that previously compiled against NVTETensor dbias. It is worth auditing all internal call sites (Python pybind layer, etc.) and explicitly documenting the migration in the PR / changelog.
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
97ec071 to
fef9220
Compare
transformer_engine/common/cast/nvfp4/specialized/group_quantize_transpose_nvfp4_tuned_1D.cuh
Outdated
Show resolved
Hide resolved
transformer_engine/common/cast/nvfp4/specialized/group_quantize_transpose_nvfp4_tuned_1D.cuh
Show resolved
Hide resolved
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
2e289c9 to
9e37b4c
Compare
|
|
||
| const size_t rows = tensor_rows / chunk_dim_Y; | ||
| const size_t cols = last_logical_dim; | ||
|
|
||
| const size_t dbias_in_offset_Y = | ||
| (shape_rep == ShapeRepresentation::SAME_BOTH_DIMS) | ||
| ? (tensor_id * (tensor_rows / chunk_dim_Y)) | ||
| : (static_cast<size_t>(offsets_ptr[tensor_id]) / cols / chunk_dim_Y); | ||
|
|
||
| const size_t thread_id = blockIdx.x * blockDim.x + threadIdx.x; | ||
|
|
||
| if (thread_id * nvec >= cols) { |
There was a problem hiding this comment.
group_reduce_dbias_kernel uses last_logical_dim as cols for all shape representations
cols is unconditionally set to last_logical_dim:
const size_t cols = last_logical_dim;For VARYING_LAST_DIM and VARYING_BOTH_DIMS shapes, each tensor has a different last dimension. Using the scalar last_logical_dim for all tensors will produce incorrect partial-dbias strides and wrong output write offsets (tensor_id * cols assumes uniform column counts). The same issue affects the dbias_in_offset_Y calculation for those shape representations.
cast.h documents that "Grouped dbias is not yet supported for grouped tensors with a varying last dimension," but there is no runtime guard in grouped_reduce_dbias or this kernel to enforce that. If called with such shapes the function silently corrupts memory. Consider adding an explicit NVTE_CHECK(shape_rep != ShapeRepresentation::VARYING_LAST_DIM && shape_rep != ShapeRepresentation::VARYING_BOTH_DIMS, ...) guard in grouped_reduce_dbias before the kernel launch.
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
f6b5928 to
6a7409d
Compare
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
50a4921 to
6c5cc7f
Compare
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
c83b558 to
f5e2ba0
Compare
for more information, see https://pre-commit.ci
| void nvte_group_quantize_dbias(const NVTEGroupedTensor input, NVTEGroupedTensor output, | ||
| NVTETensor dbias, NVTETensor workspace, cudaStream_t stream); | ||
| NVTEGroupedTensor dbias, NVTETensor workspace, cudaStream_t stream); |
There was a problem hiding this comment.
The six nvte_group_quantize_dbias* functions (this one and nvte_group_quantize_dbias_dgelu, nvte_group_quantize_dbias_dsilu, nvte_group_quantize_dbias_drelu) now take NVTEGroupedTensor dbias instead of NVTETensor dbias. This is a signature change that may affect existing C/C++ callers compiled against the old header. While this appears intentional as part of the grouped tensor API consolidation, consider auditing Python bindings and any external C++ code to ensure compatibility, and explicitly document the migration path in the changelog.
|
|
||
| const bool use_single_work_grid = (shape_rep == ShapeRepresentation::SAME_BOTH_DIMS || | ||
| shape_rep == ShapeRepresentation::VARYING_FIRST_DIM); | ||
|
|
||
| const size_t first_logical_dim = input->logical_shape.data[0]; | ||
| const size_t last_logical_dim = input->logical_shape.data[1]; | ||
| const size_t elts_total = first_logical_dim * last_logical_dim; | ||
| const size_t num_tensors = input->num_tensors; | ||
|
|
||
| NVTE_CHECK(num_tensors <= MAX_SUPPORTED_TENSOR_DESCRIPTORS, | ||
| "Number of tensors in a group is larger than the MAX number of supported " | ||
| "descriptors (64)."); |
There was a problem hiding this comment.
For the SAME_BOTH_DIMS case, the kernel computes per-tensor row counts via integer division (first_logical_dim / num_tensors), which silently truncates if first_logical_dim is not exactly divisible by num_tensors. This causes incorrect base offsets and may skip or overwrite the last few rows. Add a host-side check:
if (shape_rep == ShapeRepresentation::SAME_BOTH_DIMS) {
NVTE_CHECK(first_logical_dim % num_tensors == 0,
"For SAME_BOTH_DIMS, first_logical_dim (", first_logical_dim,
") must be divisible by num_tensors (", num_tensors, ").");
}Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
811a146 to
eace4a6
Compare
for more information, see https://pre-commit.ci
Description
This PR adds a persistent grouped NVFP4 quantization + transpose kernel with static scheduling.
It is built on top of the PR#2738 [Common] Persistent Grouped MXFP8 quantization kernel
Type of change
Changes
Checklist: