[PyTorch] [CI] Capture subprocess stderr in distributed tests for better CI error re… by sudhakarsingh27 · Pull Request #2802 · NVIDIA/TransformerEngine

sudhakarsingh27 · 2026-03-25T23:32:20Z

…porting

Distributed tests launch subprocesses via torch.distributed.launch/torchrun. When these fail, pytest only captures the CalledProcessError from the parent process, not the actual worker traceback. This makes CI JUnit XML reports show "exit code 1" with no useful error detail.

Add run_distributed() utility to tests/pytorch/utils.py that captures stderr while letting stdout stream to the terminal. On failure, the worker's stderr (containing the actual Python traceback) is included in the AssertionError, which pytest writes into the JUnit XML report.

Behavior:

Interactive use: stdout streams in real time (unchanged), stderr shown on failure
CI/JUnit XML: failure reports now include the actual worker traceback

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…porting Distributed tests launch subprocesses via torch.distributed.launch/torchrun. When these fail, pytest only captures the CalledProcessError from the parent process, not the actual worker traceback. This makes CI JUnit XML reports show "exit code 1" with no useful error detail. Add run_distributed() utility to tests/pytorch/utils.py that captures stderr while letting stdout stream to the terminal. On failure, the worker's stderr (containing the actual Python traceback) is included in the AssertionError, which pytest writes into the JUnit XML report. Behavior: - Interactive use: stdout streams in real time (unchanged), stderr shown on failure - CI/JUnit XML: failure reports now include the actual worker traceback Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Add --output-junit flag so ctest writes JUnit XML to /logs/, matching the pattern used by pytest tests. The XML is written before ctest exits, so it's captured even on test failure. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

greptile-apps · 2026-03-27T07:54:06Z

Greptile Summary

This PR improves CI error reporting for distributed PyTorch tests by introducing a run_distributed() helper in tests/pytorch/utils.py that captures subprocess stderr and includes it in AssertionError messages (and therefore JUnit XML reports) on failure, while leaving stdout to stream live to the terminal. Six test files are migrated from raw subprocess.run(..., check=True) calls to this new utility, and qa/L0_cppunittest/test.sh gains JUnit XML output for C++ unit tests.

Key changes:

tests/pytorch/utils.py: New run_distributed() wraps subprocess.run with stderr=subprocess.PIPE, text=True; on non-zero exit it raises AssertionError with up to 4 000 chars of the stderr tail, making worker tracebacks visible in JUnit XML.
valid_returncodes parameter: Callers that run inner pytest processes (e.g. FSDP2 tests) can pass valid_returncodes=(0, 5) to treat pytest's "all-skipped" exit code as success.
qa/L0_cppunittest/test.sh: Adds --output-junit to ctest so C++ unit test results also appear in JUnit XML reports.
One unused import subprocess remains in test_torch_fsdp2.py after the migration and should be cleaned up.

Confidence Score: 5/5

Safe to merge — all remaining findings are minor style issues that do not affect correctness or CI reliability.

The core utility is well-designed: stderr is captured for reporting, stdout streams live, truncation logic is correct, and valid_returncodes handles the inner-pytest exit-5 case cleanly. The only issue found is an unused import subprocess left in test_torch_fsdp2.py, which is P2 style only.

tests/pytorch/distributed/test_torch_fsdp2.py — unused import subprocess should be removed.

Important Files Changed

Filename	Overview
tests/pytorch/utils.py	Adds run_distributed() helper that captures stderr via subprocess.PIPE while leaving stdout to stream live; raises AssertionError with a 4000-char stderr tail on failure.
tests/pytorch/distributed/test_torch_fsdp2.py	Migrates two subprocess.run calls to run_distributed with valid_returncodes=(0,5); leaves an unused `import subprocess` that should be removed.
tests/pytorch/attention/test_attention_with_cp.py	Replaces two subprocess.run(..., check=True) calls with run_distributed(); no functional changes to test logic.
tests/pytorch/distributed/test_cast_master_weights_to_fp8.py	Adds sys.path manipulation and run_distributed import; replaces subprocess.run(command, check=True) with run_distributed(command).
tests/pytorch/distributed/test_fusible_ops_with_userbuffers.py	Adds run_distributed to imports; replaces subprocess.run(..., check=True, env=env) with run_distributed(command, env=env).
qa/L0_cppunittest/test.sh	Adds XML_LOG_DIR setup and passes --output-junit to ctest for JUnit XML reporting in CI.

Sequence Diagram

sequenceDiagram
    participant PT as pytest (parent)
    participant RD as run_distributed()
    participant SP as subprocess.run
    participant WK as torchrun / worker processes

    PT->>RD: call run_distributed(args, **kwargs)
    RD->>SP: subprocess.run(args, stderr=PIPE, text=True, **kwargs)
    SP->>WK: launch torchrun / distributed workers
    WK-->>SP: stdout → inherited (streams to terminal)
    WK-->>SP: stderr → PIPE (buffered in memory)
    SP-->>RD: CompletedProcess(returncode, stderr)

    alt returncode in valid_returncodes
        RD-->>PT: return CompletedProcess (success)
    else returncode not in valid_returncodes
        RD->>RD: build msg with last 4000 chars of stderr
        RD-->>PT: raise AssertionError(msg)
        PT->>PT: write AssertionError + stderr tail to JUnit XML
    end

_{Reviews (3): Last reviewed commit: "Merge branch 'main' into sudhakars/impro..." | Re-trigger Greptile}

greptile-apps · 2026-03-27T07:54:13Z

tests/pytorch/utils.py

+            Use (0, 5) for inner pytest runs where 5 means all tests skipped.
+        **kwargs: Passed through to subprocess.run (e.g. env, timeout).
+    """
+    result = subprocess.run(args, stderr=subprocess.PIPE, text=True, **kwargs)


**kwargs can silently conflict with stderr and text

If a caller ever passes stderr= or text= through **kwargs, Python will raise TypeError: subprocess.run() got multiple values for keyword argument 'stderr'. Consider explicitly popping or blocking those keys, or documenting the restriction:

kwargs.pop("stderr", None) # always captured internally kwargs.pop("text", None) # always text mode internally result = subprocess.run(args, stderr=subprocess.PIPE, text=True, **kwargs)

None of the current call sites pass these, so this is not an immediate bug — just a fragile API surface.

timmoon10 · 2026-04-02T22:20:55Z

/te-ci pytorch L1

timmoon10

LGTM

sudhakarsingh27 added 2 commits March 25, 2026 16:28

Add JUnit XML output to ctest in L0_cppunittest

fb9812c

Add --output-junit flag so ctest writes JUnit XML to /logs/, matching the pattern used by pytest tests. The XML is written before ctest exits, so it's captured even on test failure. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 marked this pull request as ready for review March 27, 2026 07:51

greptile-apps bot reviewed Mar 27, 2026

View reviewed changes

sudhakarsingh27 requested review from KshitijLakhani, cyanguwa, ptrendx and timmoon10 March 28, 2026 04:20

sudhakarsingh27 and others added 2 commits March 31, 2026 14:42

Merge branch 'main' into sudhakars/improve-distributed-test-errors

9fedb6e

Merge branch 'main' into sudhakars/improve-distributed-test-errors

9681893

timmoon10 approved these changes Apr 2, 2026

View reviewed changes

timmoon10 merged commit a88fdc1 into NVIDIA:main Apr 3, 2026
27 of 30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] [CI] Capture subprocess stderr in distributed tests for better CI error re…#2802

[PyTorch] [CI] Capture subprocess stderr in distributed tests for better CI error re…#2802
timmoon10 merged 4 commits intoNVIDIA:mainfrom
sudhakarsingh27:sudhakars/improve-distributed-test-errors

sudhakarsingh27 commented Mar 25, 2026

Uh oh!

greptile-apps bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

greptile-apps bot Mar 27, 2026

Uh oh!

timmoon10 commented Apr 2, 2026

Uh oh!

timmoon10 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sudhakarsingh27 commented Mar 25, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented Apr 2, 2026

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Mar 27, 2026 •

edited

Loading