[STRESS / DO NOT MERGE] Enable FabricFrameView cuda:1 tests in multi-GPU CI by pv-nvidia · Pull Request #5822 · isaac-sim/IsaacLab

pv-nvidia · 2026-05-28T00:08:23Z

1. Summary

Recreates [STRESS / DO NOT MERGE] Enable FabricFrameView cuda:1 tests in multi-GPU CI #5788 on top of latest develop from pv-nvidia:pv/fabric-mgpu-ci-stress.
Enables the FabricFrameView multi-GPU workflow for PRs and sets ISAACLAB_TEST_MULTI_GPU=1 so the cuda:1 tests added by Enable mgpu in FrameView #5514 run in CI.
Adds the setup needed for the self-hosted multi-GPU runner path: Python 3.12, pip-installed cmake, ./isaaclab.sh --install none, and Isaac Sim from https://pypi.nvidia.com.

2. Status

Draft / stress-test PR. This is intended to reproduce and monitor the FabricFrameView cuda:1 behavior in CI, not to be merged until the runtime behavior is understood/fixed.

3. Local reproduction

cd <IsaacLab worktree>   # must be on a checkout that has PR #5514 merged (latest develop)
conda activate env_isaaclab

timeout 90 env \
  ISAACLAB_TEST_MULTI_GPU=1 \
  OMNI_KIT_ACCEPT_EULA=yes \
  ACCEPT_EULA=Y \
  ISAAC_SIM_HEADLESS=1 \
  ./isaaclab.sh -p -m pytest \
    "source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py::test_fabric_cuda1_world_pose_roundtrip" \
    "source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py::test_fabric_cuda1_no_usd_writeback" \
    "source/isaaclab_physx/test/sim/test_views_xform_prim_fabric.py::test_fabric_cuda1_scales_roundtrip" \
    -v --tb=short

4. Bug surface

File: source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.py
Code path: USDRT SelectPrims / FabricFrameView initialization with non-zero CUDA device indices.
Context: Enable mgpu in FrameView #5514 removed the previous cuda:0-only Fabric allowlist because USDRT was expected to support arbitrary CUDA device indices.

5. Test plan

Workflow appears in checks on this draft PR.
Multi-GPU runner is selected.
cuda:1 FabricFrameView tests run with ISAACLAB_TEST_MULTI_GPU=1.

Recreated from #5788.

isaaclab-review-bot

Review Summary

This PR enables FabricFrameView multi-GPU stress tests in CI to reproduce and monitor the cuda:1 hanging behavior reported in #5514.

Analysis

Workflow Changes (.github/workflows/test-fabric-multi-gpu.yaml):

Trigger re-enabled — The workflow now triggers on PRs touching the relevant files (fabric_frame_view.py, test_views_xform_prim_fabric.py, and the workflow itself), plus manual dispatch. Previously disabled due to missing multi-GPU runner.
Runner label update — Changed from [self-hosted, linux, x64, gpu, multi-gpu] to [self-hosted, linux, x64, multi-gpu] (removed gpu label).
Timeout increased — 30 → 60 minutes, appropriate for a stress test expected to surface hangs.
Python 3.12 setup — Uses setup-python v5 with explicit pin for reproducibility.
cmake via pip — Clever workaround to avoid the sudo apt-get path in install.py by providing cmake on PATH via pip wheel.
Minimal install — --install none skips robomimic/teleop extras that require libEGL/X11 headers the runner lacks. Good for isolating the FabricFrameView tests.
Isaac Sim 6.0.0 — Installed separately since --install none skips it. Correctly pinned to 6.0.0 for Python 3.12 compatibility, with fallback to ${{ vars.ISAACSIM_BASE_VERSION }}.
Environment variables — Added OMNI_KIT_ACCEPT_EULA, ACCEPT_EULA, ISAAC_SIM_HEADLESS=1, and ISAACLAB_TEST_MULTI_GPU=1 to enable the cuda:1 tests.

Observations

The PR description and commit comments are thorough — clearly documents the purpose, local repro steps, and bug surface.
Marked as STRESS/DO NOT MERGE appropriately.
The infrastructure label is correct.

Suggestions

Consider adding a comment in the workflow file noting that the Isaac Sim version pin should be kept in sync with pyproject.toml once the 3.12 baseline is fully resolved upstream.
The multi-gpu runner assumption (≥2 GPUs) is validated in the "Verify multi-GPU availability" step, which is good defensive programming.

No blocking issues. This is a well-structured stress test PR for reproducing CI behavior around FabricFrameView cuda:1 hangs.

Update (6afe94e): Reviewed incremental changes. The new commit narrows the test scope from the full test file to the specific hanging test case (test_fabric_cuda1_world_pose_roundtrip[cuda:1]) and simplifies the explanatory comment. Sensible refinement for focused debugging. No new issues.

Update (25b3f03): Extensive diagnostic instrumentation added across the codebase:

Workflow: Added -s --tb=short pytest flags for immediate output and shorter tracebacks during hangs.
fabric_frame_view.py: Added _fabric_diag() helper with flush-safe prints throughout set_world_poses(), get_world_poses(), _prepare_for_reuse(), _rebuild_fabric_arrays(), and _initialize_fabric() — traces entry/exit and each major operation.
test file: Added faulthandler.dump_traceback_later(120, repeat=True) to auto-dump traceback on hang, plus _diag() calls throughout fixture and test factory.

All changes are debugging instrumentation to identify exactly where cuda:1 hangs occur. No functional changes, appropriate for a STRESS test. No new issues.

Update (582813b): Added source/isaaclab_physx/changelog.d/pv-fabric-mgpu-ci-stress.skip — empty skip file to suppress changelog generation for this stress test branch. Appropriate for a DO NOT MERGE debugging PR. No new issues.

isaaclab-review-bot

Update (582813b): Added source/isaaclab_physx/changelog.d/pv-fabric-mgpu-ci-stress.skip — an empty skip file to disable changelog generation for this stress test PR. Appropriate for a DO NOT MERGE debugging branch. No issues.

Update (82f9ab7): Added CI step to reinstall pytest/test harness after Isaac Sim wheel install (addresses dependency resolution issues). Sensible workaround for this stress test branch. No issues.

Update (7e418e2): CI hardening improvements:

Removed coverage from reinstall to avoid resolver conflict with isaacsim-kernel's pinned version
Added robust GPU count parsing (awk filter + empty check) to handle spurious output
Added set -o pipefail and explicit pass-count validation to catch silent test failures

All sensible defensive changes for this stress test branch. No issues.

Update (78f38c0): Added version diagnostics block to test_views_xform_prim_fabric.py:

Prints Kit version, kernel version, git hash at test startup
Logs enabled extension versions for omni.fabric.core, omni.usdrt.core, omni.usdrt.scenegraph, and omni.physx
Cleans up temporary variables after printing

This is helpful debugging instrumentation for investigating multi-GPU CI issues. No code concerns — purely additive diagnostics that won't affect test behavior. ✅

Update (0699d8f): Major CI infrastructure change — switched from pip-installed Isaac Sim to Docker-based approach:

Added config job to load Isaac Sim image name/tag from config.yaml
Added build-base job using ecr-build-push-pull action for Docker image
Replaced manual pip install isaacsim with containerized test execution via .github/actions/run-tests
Simplified GPU check to use direct nvidia-smi query instead of Python/torch
Added proper test result parsing from JUnit XML reports

This aligns with build.yaml's approach to ensure tests run against the Kit version baked into the Isaac Sim container. Good architectural change for CI consistency. ✅

Adds version diagnostics at module load to confirm which Kit kernel, Fabric, and UsdRT versions are actually running inside the CI container.

The previous workflow installed isaacsim==6.0.0 via pip on bare metal, which pulled Kit 110.0 regardless of what the container image shipped. This meant tests were running against a stale Kit version instead of the latest-develop (Kit 111.0). Rewrites the workflow to use the same Docker-based pattern as build.yaml: - Build/pull the base image from ECR (nvcr.io/nvidian/isaac-sim:latest-develop) - Run pytest inside the container via the run-tests action - Volume-mount the workspace so the PR's source is tested This ensures the test environment matches what other CI jobs use and tests always run against the Kit version in the container.

The previous commit used run-tests directly, but that action expects the Docker image to already be available locally. Since the build and test jobs can land on different runners, the image must be pulled from ECR first. Switch to using the run-package-tests composite action (same as all other test jobs in build.yaml), which handles: 1. Pulling the image from ECR via ecr-build-push-pull 2. Running tests inside the container via run-tests 3. Uploading artifacts and checking results

ci: enable FabricFrameView multi-GPU stress test

da90ca3

github-actions Bot added the infrastructure label May 28, 2026

isaaclab-review-bot Bot reviewed May 28, 2026

View reviewed changes

Horde added 2 commits May 28, 2026 06:52

ci: focus FabricFrameView multi-GPU test on cuda1 roundtrip

6afe94e

test: log FabricFrameView cuda1 hang path

25b3f03

github-actions Bot added the isaac-lab Related to Isaac Lab team label May 28, 2026

chore: skip changelog for FabricFrameView CI stress

582813b

isaaclab-review-bot Bot reviewed May 28, 2026

View reviewed changes

Horde and others added 4 commits May 28, 2026 09:14

ci: reinstall pytest after Isaac Sim setup

82f9ab7

ci: fail focused FabricFrameView job without pytest pass

7e418e2

ci: print Kit/Fabric/UsdRT version info in fabric mgpu test

78f38c0

Adds version diagnostics at module load to confirm which Kit kernel, Fabric, and UsdRT versions are actually running inside the CI container.

pv-nvidia mentioned this pull request May 28, 2026

feat: log Kit version info on AppLauncher startup #5844

Open

pv-nvidia mentioned this pull request May 28, 2026

ci: add Docker-based FabricFrameView multi-GPU test workflow #5845

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[STRESS / DO NOT MERGE] Enable FabricFrameView cuda:1 tests in multi-GPU CI#5822

[STRESS / DO NOT MERGE] Enable FabricFrameView cuda:1 tests in multi-GPU CI#5822
pv-nvidia wants to merge 9 commits into
isaac-sim:developfrom
pv-nvidia:pv/fabric-mgpu-ci-stress

pv-nvidia commented May 28, 2026

Uh oh!

isaaclab-review-bot Bot left a comment •

edited

Loading

Uh oh!

isaaclab-review-bot Bot left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pv-nvidia commented May 28, 2026

1. Summary

2. Status

3. Local reproduction

4. Bug surface

5. Test plan

Uh oh!

isaaclab-review-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Review Summary

Analysis

Observations

Suggestions

Uh oh!

isaaclab-review-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

isaaclab-review-bot Bot left a comment •

edited

Loading

isaaclab-review-bot Bot left a comment •

edited

Loading