[PyTorch Debug] Support tensor dump by pggPL · Pull Request #2645 · NVIDIA/TransformerEngine

pggPL · 2026-02-03T10:44:04Z

Description

This PR introduces a new debug feature focused on offline analysis of tensors.
The motivation is to make it easier to inspect and analyze intermediate tensors outside of runtime, especially during quantization debugging.

The new DumpTensors feature allows saving:

high-precision tensors (before quantization),
quantized tensors (after quantization),
optional quantization internals (e.g. decoded data/scales, and amax for NVFP4).

A key context is that some quantization metadata (notably scale-related values in NVFP4 paths) is stored in compact/perf-oriented formats (uint8-backed representations), which are efficient but hard to analyze directly.
To improve offline usability, the dump path converts these values into appropriate floating-point dtypes for easier interpretation.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Added new debug feature: transformer_engine.debug.features.dump_tensors.DumpTensors.
Added support for dumping high-precision and quantized tensors via inspect_tensor.
Added optional dumping of quantized internals (dump_quantized_internals) for FP8/FP8-blockwise/MXFP8/NVFP4 tensor types.
Added conversion of scale-related internals to float dtypes for better offline analysis (including NVFP4-related fields).
Added/updated tests in tests/pytorch/debug/test_log.py for DumpTensors sanity flow.
Updated debug documentation/API listing to include DumpTensors in docs/debug/3_api_features.rst.
Fixed robustness issues found in review:
- logger re-initialization across debug sessions,
- dump test validation timing (before temp directory cleanup).

Checklist

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: root <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: root <[email protected]>

greptile-apps · 2026-03-05T10:48:13Z

Greptile Summary

This PR introduces a new DumpTensors debug feature for offline tensor analysis, allowing users to save pre-quantization and post-quantization tensors (plus their internals) to disk via torch.save. It addresses several issues from a prior review round (missing filepath variable, type annotation mismatches, silent failures, missing log messages) and adds unit tests for FP8 and NVFP4 tensor types.

Key changes:

New transformer_engine/debug/features/dump_tensors.py with TensorLogger singleton, DumpTensors feature class, and per-type helper functions for FP8, FP8-blockwise, MXFP8, and NVFP4 internals extraction.
NVFP4 scales are reinterpreted as float8_e4m3fn, MXFP8 scales as float8_e8m0fnu, and packed NVFP4 data is unpacked + decoded to float32 for easier offline use.
Two new pytest tests covering FP8 sanity and NVFP4 unpacked-value reconstruction validation.

Critical concern: _get_extended_tensors_mxfp8 and _get_extended_tensors_nvfp4 save _rowwise_scale_inv / _columnwise_scale_inv verbatim, but in real training optimize_for_gemm=True (set by basic_linear.py) causes these tensors to be in a swizzled tile layout. The dump then contains scales that cannot be naively expanded (repeat_interleave) to reconstruct values — the primary offline-analysis use-case the feature advertises. Neither a warning nor unswizzling logic is present. The tests pass because they use RecipeState.create directly, which defaults to optimize_for_gemm=False.

Confidence Score: 3/5

Safe to merge as a debug/non-breaking feature, but the swizzled-scale gap makes it misleading for the most common production training scenario.
The feature works correctly for the test cases (non-swizzled scales, direct RecipeState construction) and all previously flagged runtime bugs have been fixed. However, production linear-layer training sets optimize_for_gemm=True, which puts MXFP8 and NVFP4 scales in a swizzled layout that the dump stores without any indication, making the advertised offline reconstruction workflow incorrect in that scenario. Tests don't cover swizzled scale paths.
transformer_engine/debug/features/dump_tensors.py — specifically _get_extended_tensors_mxfp8 and _get_extended_tensors_nvfp4 for the swizzled scale handling gap.

Important Files Changed

Filename	Overview
transformer_engine/debug/features/dump_tensors.py	New DumpTensors feature; core logic is sound and previous review issues (filepath NameError, missing log messages, type annotations) are fixed, but swizzled-scale layout for MXFP8/NVFP4 in production training is unhandled and could produce misleading offline-analysis data.
tests/pytorch/debug/test_log.py	Two new tests added — a basic FP8 sanity check and an NVFP4 unpacked-codes reconstruction test; tests use module-level RecipeState import correctly and validate file creation, dict structure, dtype, shape, and dequantization fidelity, but only cover non-swizzled scale cases.
transformer_engine/debug/features/log_fp8_tensor_stats.py	Minor import reordering only — moves `transformer_engine_torch as tex` import below the nvdlfw_inspect imports; no logic change.
docs/debug/3_api_features.rst	Documentation-only change adding the new DumpTensors autoapiclass entry; newline at end of file still absent but this is a pre-existing pattern in the file.