Skip to content

[STRESS / DO NOT MERGE] Enable FabricFrameView cuda:1 tests in multi-GPU CI#5822

Draft
pv-nvidia wants to merge 9 commits into
isaac-sim:developfrom
pv-nvidia:pv/fabric-mgpu-ci-stress
Draft

[STRESS / DO NOT MERGE] Enable FabricFrameView cuda:1 tests in multi-GPU CI#5822
pv-nvidia wants to merge 9 commits into
isaac-sim:developfrom
pv-nvidia:pv/fabric-mgpu-ci-stress

Conversation

@pv-nvidia
Copy link
Copy Markdown
Contributor

1. Summary

2. Status

Draft / stress-test PR. This is intended to reproduce and monitor the FabricFrameView cuda:1 behavior in CI, not to be merged until the runtime behavior is understood/fixed.

3. Local reproduction

cd <IsaacLab worktree>   # must be on a checkout that has PR #5514 merged (latest develop)
conda activate env_isaaclab

timeout 90 env \
  ISAACLAB_TEST_MULTI_GPU=1 \
  OMNI_KIT_ACCEPT_EULA=yes \
  ACCEPT_EULA=Y \
  ISAAC_SIM_HEADLESS=1 \
  ./isaaclab.sh -p -m pytest \
    "source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py::test_fabric_cuda1_world_pose_roundtrip" \
    "source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py::test_fabric_cuda1_no_usd_writeback" \
    "source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py::test_fabric_cuda1_scales_roundtrip" \
    -v --tb=short

4. Bug surface

  • File: source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py
  • Code path: USDRT SelectPrims / FabricFrameView initialization with non-zero CUDA device indices.
  • Context: Enable mgpu in FrameView #5514 removed the previous cuda:0-only Fabric allowlist because USDRT was expected to support arbitrary CUDA device indices.

5. Test plan

  • Workflow appears in checks on this draft PR.
  • Multi-GPU runner is selected.
  • cuda:1 FabricFrameView tests run with ISAACLAB_TEST_MULTI_GPU=1.

Recreated from #5788.

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This PR enables FabricFrameView multi-GPU stress tests in CI to reproduce and monitor the cuda:1 hanging behavior reported in #5514.

Analysis

Workflow Changes (.github/workflows/test-fabric-multi-gpu.yaml):

  1. Trigger re-enabled — The workflow now triggers on PRs touching the relevant files (fabric_frame_view.py, test_views_xform_prim_fabric.py, and the workflow itself), plus manual dispatch. Previously disabled due to missing multi-GPU runner.

  2. Runner label update — Changed from [self-hosted, linux, x64, gpu, multi-gpu] to [self-hosted, linux, x64, multi-gpu] (removed gpu label).

  3. Timeout increased — 30 → 60 minutes, appropriate for a stress test expected to surface hangs.

  4. Python 3.12 setup — Uses setup-python v5 with explicit pin for reproducibility.

  5. cmake via pip — Clever workaround to avoid the sudo apt-get path in install.py by providing cmake on PATH via pip wheel.

  6. Minimal install--install none skips robomimic/teleop extras that require libEGL/X11 headers the runner lacks. Good for isolating the FabricFrameView tests.

  7. Isaac Sim 6.0.0 — Installed separately since --install none skips it. Correctly pinned to 6.0.0 for Python 3.12 compatibility, with fallback to ${{ vars.ISAACSIM_BASE_VERSION }}.

  8. Environment variables — Added OMNI_KIT_ACCEPT_EULA, ACCEPT_EULA, ISAAC_SIM_HEADLESS=1, and ISAACLAB_TEST_MULTI_GPU=1 to enable the cuda:1 tests.

Observations

  • The PR description and commit comments are thorough — clearly documents the purpose, local repro steps, and bug surface.
  • Marked as STRESS/DO NOT MERGE appropriately.
  • The infrastructure label is correct.

Suggestions

  1. Consider adding a comment in the workflow file noting that the Isaac Sim version pin should be kept in sync with pyproject.toml once the 3.12 baseline is fully resolved upstream.

  2. The multi-gpu runner assumption (≥2 GPUs) is validated in the "Verify multi-GPU availability" step, which is good defensive programming.

No blocking issues. This is a well-structured stress test PR for reproducing CI behavior around FabricFrameView cuda:1 hangs.


Update (6afe94e): Reviewed incremental changes. The new commit narrows the test scope from the full test file to the specific hanging test case (test_fabric_cuda1_world_pose_roundtrip[cuda:1]) and simplifies the explanatory comment. Sensible refinement for focused debugging. No new issues.


Update (25b3f03): Extensive diagnostic instrumentation added across the codebase:

  • Workflow: Added -s --tb=short pytest flags for immediate output and shorter tracebacks during hangs.
  • fabric_frame_view.py: Added _fabric_diag() helper with flush-safe prints throughout set_world_poses(), get_world_poses(), _prepare_for_reuse(), _rebuild_fabric_arrays(), and _initialize_fabric() — traces entry/exit and each major operation.
  • test file: Added faulthandler.dump_traceback_later(120, repeat=True) to auto-dump traceback on hang, plus _diag() calls throughout fixture and test factory.

All changes are debugging instrumentation to identify exactly where cuda:1 hangs occur. No functional changes, appropriate for a STRESS test. No new issues.


Update (582813b): Added source/isaaclab_physx/changelog.d/pv-fabric-mgpu-ci-stress.skip — empty skip file to suppress changelog generation for this stress test branch. Appropriate for a DO NOT MERGE debugging PR. No new issues.

@github-actions github-actions Bot added the isaac-lab Related to Isaac Lab team label May 28, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update (582813b): Added source/isaaclab_physx/changelog.d/pv-fabric-mgpu-ci-stress.skip — an empty skip file to disable changelog generation for this stress test PR. Appropriate for a DO NOT MERGE debugging branch. No issues.


Update (82f9ab7): Added CI step to reinstall pytest/test harness after Isaac Sim wheel install (addresses dependency resolution issues). Sensible workaround for this stress test branch. No issues.


Update (7e418e2): CI hardening improvements:

  • Removed coverage from reinstall to avoid resolver conflict with isaacsim-kernel's pinned version
  • Added robust GPU count parsing (awk filter + empty check) to handle spurious output
  • Added set -o pipefail and explicit pass-count validation to catch silent test failures

All sensible defensive changes for this stress test branch. No issues.


Update (78f38c0): Added version diagnostics block to test_views_xform_prim_fabric.py:

  • Prints Kit version, kernel version, git hash at test startup
  • Logs enabled extension versions for omni.fabric.core, omni.usdrt.core, omni.usdrt.scenegraph, and omni.physx
  • Cleans up temporary variables after printing

This is helpful debugging instrumentation for investigating multi-GPU CI issues. No code concerns — purely additive diagnostics that won't affect test behavior. ✅


Update (0699d8f): Major CI infrastructure change — switched from pip-installed Isaac Sim to Docker-based approach:

  • Added config job to load Isaac Sim image name/tag from config.yaml
  • Added build-base job using ecr-build-push-pull action for Docker image
  • Replaced manual pip install isaacsim with containerized test execution via .github/actions/run-tests
  • Simplified GPU check to use direct nvidia-smi query instead of Python/torch
  • Added proper test result parsing from JUnit XML reports

This aligns with build.yaml's approach to ensure tests run against the Kit version baked into the Isaac Sim container. Good architectural change for CI consistency. ✅

Horde and others added 4 commits May 28, 2026 09:14
Adds version diagnostics at module load to confirm which Kit kernel,
Fabric, and UsdRT versions are actually running inside the CI container.
The previous workflow installed isaacsim==6.0.0 via pip on bare metal,
which pulled Kit 110.0 regardless of what the container image shipped.
This meant tests were running against a stale Kit version instead of
the latest-develop (Kit 111.0).

Rewrites the workflow to use the same Docker-based pattern as build.yaml:
- Build/pull the base image from ECR (nvcr.io/nvidian/isaac-sim:latest-develop)
- Run pytest inside the container via the run-tests action
- Volume-mount the workspace so the PR's source is tested

This ensures the test environment matches what other CI jobs use and
tests always run against the Kit version in the container.
The previous commit used run-tests directly, but that action expects the
Docker image to already be available locally. Since the build and test
jobs can land on different runners, the image must be pulled from ECR
first.

Switch to using the run-package-tests composite action (same as all
other test jobs in build.yaml), which handles:
1. Pulling the image from ECR via ecr-build-push-pull
2. Running tests inside the container via run-tests
3. Uploading artifacts and checking results
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

infrastructure isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant