Skip to content

[PyTorch Debug] Support tensor dump#2645

Open
pggPL wants to merge 18 commits intoNVIDIA:mainfrom
pggPL:inpsect_tensor_dump_support
Open

[PyTorch Debug] Support tensor dump#2645
pggPL wants to merge 18 commits intoNVIDIA:mainfrom
pggPL:inpsect_tensor_dump_support

Conversation

@pggPL
Copy link
Collaborator

@pggPL pggPL commented Feb 3, 2026

Description

This PR introduces a new debug feature focused on offline analysis of tensors.
The motivation is to make it easier to inspect and analyze intermediate tensors outside of runtime, especially during quantization debugging.

The new DumpTensors feature allows saving:

  • high-precision tensors (before quantization),
  • quantized tensors (after quantization),
  • optional quantization internals (e.g. decoded data/scales, and amax for NVFP4).

A key context is that some quantization metadata (notably scale-related values in NVFP4 paths) is stored in compact/perf-oriented formats (uint8-backed representations), which are efficient but hard to analyze directly.
To improve offline usability, the dump path converts these values into appropriate floating-point dtypes for easier interpretation.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Added new debug feature: transformer_engine.debug.features.dump_tensors.DumpTensors.
  • Added support for dumping high-precision and quantized tensors via inspect_tensor.
  • Added optional dumping of quantized internals (dump_quantized_internals) for FP8/FP8-blockwise/MXFP8/NVFP4 tensor types.
  • Added conversion of scale-related internals to float dtypes for better offline analysis (including NVFP4-related fields).
  • Added/updated tests in tests/pytorch/debug/test_log.py for DumpTensors sanity flow.
  • Updated debug documentation/API listing to include DumpTensors in docs/debug/3_api_features.rst.
  • Fixed robustness issues found in review:
    • logger re-initialization across debug sessions,
    • dump test validation timing (before temp directory cleanup).

Checklist

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

pggPL added 2 commits February 3, 2026 08:54
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
@pggPL pggPL changed the title [Debug] Support tensor dump [PyTorch Debug] Support tensor dump Feb 3, 2026
@pggPL pggPL marked this pull request as ready for review March 5, 2026 10:44
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 5, 2026

Greptile Summary

This PR introduces a new DumpTensors debug feature for offline tensor analysis, allowing users to save pre-quantization and post-quantization tensors (plus their internals) to disk via torch.save. It addresses several issues from a prior review round (missing filepath variable, type annotation mismatches, silent failures, missing log messages) and adds unit tests for FP8 and NVFP4 tensor types.

Key changes:

  • New transformer_engine/debug/features/dump_tensors.py with TensorLogger singleton, DumpTensors feature class, and per-type helper functions for FP8, FP8-blockwise, MXFP8, and NVFP4 internals extraction.
  • NVFP4 scales are reinterpreted as float8_e4m3fn, MXFP8 scales as float8_e8m0fnu, and packed NVFP4 data is unpacked + decoded to float32 for easier offline use.
  • Two new pytest tests covering FP8 sanity and NVFP4 unpacked-value reconstruction validation.

Critical concern: _get_extended_tensors_mxfp8 and _get_extended_tensors_nvfp4 save _rowwise_scale_inv / _columnwise_scale_inv verbatim, but in real training optimize_for_gemm=True (set by basic_linear.py) causes these tensors to be in a swizzled tile layout. The dump then contains scales that cannot be naively expanded (repeat_interleave) to reconstruct values — the primary offline-analysis use-case the feature advertises. Neither a warning nor unswizzling logic is present. The tests pass because they use RecipeState.create directly, which defaults to optimize_for_gemm=False.

Confidence Score: 3/5

  • Safe to merge as a debug/non-breaking feature, but the swizzled-scale gap makes it misleading for the most common production training scenario.
  • The feature works correctly for the test cases (non-swizzled scales, direct RecipeState construction) and all previously flagged runtime bugs have been fixed. However, production linear-layer training sets optimize_for_gemm=True, which puts MXFP8 and NVFP4 scales in a swizzled layout that the dump stores without any indication, making the advertised offline reconstruction workflow incorrect in that scenario. Tests don't cover swizzled scale paths.
  • transformer_engine/debug/features/dump_tensors.py — specifically _get_extended_tensors_mxfp8 and _get_extended_tensors_nvfp4 for the swizzled scale handling gap.

Important Files Changed

Filename Overview
transformer_engine/debug/features/dump_tensors.py New DumpTensors feature; core logic is sound and previous review issues (filepath NameError, missing log messages, type annotations) are fixed, but swizzled-scale layout for MXFP8/NVFP4 in production training is unhandled and could produce misleading offline-analysis data.
tests/pytorch/debug/test_log.py Two new tests added — a basic FP8 sanity check and an NVFP4 unpacked-codes reconstruction test; tests use module-level RecipeState import correctly and validate file creation, dict structure, dtype, shape, and dequantization fidelity, but only cover non-swizzled scale cases.
transformer_engine/debug/features/log_fp8_tensor_stats.py Minor import reordering only — moves transformer_engine_torch as tex import below the nvdlfw_inspect imports; no logic change.
docs/debug/3_api_features.rst Documentation-only change adding the new DumpTensors autoapiclass entry; newline at end of file still absent but this is a pre-existing pattern in the file.

Sequence Diagram

sequenceDiagram
    participant User
    participant debug_api
    participant DumpTensors
    participant TensorLogger
    participant _get_quantized_internals
    participant Disk

    User->>debug_api: inspect_tensor(layer_name, tensor_name, iteration, tensor, rowwise_qt, columnwise_qt)
    debug_api->>DumpTensors: inspect_tensor(config, ...)
    DumpTensors->>DumpTensors: validate rowwise == columnwise (or one is None)
    DumpTensors->>DumpTensors: resolve quantized_tensor (rowwise ?? columnwise)
    DumpTensors->>TensorLogger: ensure_initialized(root_log_dir)
    TensorLogger-->>DumpTensors: ready

    alt dump_hp=True and tensor not None
        DumpTensors->>DumpTensors: dump_dict["high_precision"] = tensor
    end

    alt dump_quant=True and quantized_tensor not None
        DumpTensors->>DumpTensors: dump_dict["quantized"] = quantized_tensor
        opt dump_quantized_internals=True
            DumpTensors->>_get_quantized_internals: _get_quantized_internals(quantized_tensor)
            note over _get_quantized_internals: dispatch by type:<br/>Float8Tensor → data, scale_inv<br/>Float8BlockwiseQTensor → rowwise/columnwise data+scales<br/>MXFP8Tensor → data + float8_e8m0fnu scales<br/>NVFP4Tensor → packed data + unpacked FP4 values + float8_e4m3fn scales + amax
            _get_quantized_internals-->>DumpTensors: internals dict
            DumpTensors->>DumpTensors: dump_dict.update(internals)
        end
    end

    DumpTensors->>TensorLogger: save_tensor(dump_dict, layer_name, tensor_name, iteration)
    TensorLogger->>TensorLogger: sanitize names, build filepath
    TensorLogger->>Disk: torch.save(dump_dict, "{layer}_{tensor}_iter_{iter:06d}.pt")
    Disk-->>TensorLogger: saved
    TensorLogger-->>DumpTensors: done
    DumpTensors-->>debug_api: log success message
Loading

Last reviewed commit: b78d36f

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <[email protected]>
pggPL and others added 4 commits March 5, 2026 10:57
Signed-off-by: root <[email protected]>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <[email protected]>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <[email protected]>
pggPL and others added 3 commits March 5, 2026 13:13
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <[email protected]>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <[email protected]>
Signed-off-by: root <[email protected]>
pggPL and others added 2 commits March 5, 2026 14:03
@pggPL
Copy link
Collaborator Author

pggPL commented Mar 5, 2026

/te-ci pytorch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant