[STRESS / DO NOT MERGE] Enable FabricFrameView cuda:1 tests in multi-GPU CI#5822
[STRESS / DO NOT MERGE] Enable FabricFrameView cuda:1 tests in multi-GPU CI#5822pv-nvidia wants to merge 9 commits into
Conversation
There was a problem hiding this comment.
Review Summary
This PR enables FabricFrameView multi-GPU stress tests in CI to reproduce and monitor the cuda:1 hanging behavior reported in #5514.
Analysis
Workflow Changes (.github/workflows/test-fabric-multi-gpu.yaml):
-
Trigger re-enabled — The workflow now triggers on PRs touching the relevant files (fabric_frame_view.py, test_views_xform_prim_fabric.py, and the workflow itself), plus manual dispatch. Previously disabled due to missing multi-GPU runner.
-
Runner label update — Changed from
[self-hosted, linux, x64, gpu, multi-gpu]to[self-hosted, linux, x64, multi-gpu](removedgpulabel). -
Timeout increased — 30 → 60 minutes, appropriate for a stress test expected to surface hangs.
-
Python 3.12 setup — Uses setup-python v5 with explicit pin for reproducibility.
-
cmake via pip — Clever workaround to avoid the
sudo apt-getpath in install.py by providing cmake on PATH via pip wheel. -
Minimal install —
--install noneskips robomimic/teleop extras that require libEGL/X11 headers the runner lacks. Good for isolating the FabricFrameView tests. -
Isaac Sim 6.0.0 — Installed separately since
--install noneskips it. Correctly pinned to 6.0.0 for Python 3.12 compatibility, with fallback to${{ vars.ISAACSIM_BASE_VERSION }}. -
Environment variables — Added
OMNI_KIT_ACCEPT_EULA,ACCEPT_EULA,ISAAC_SIM_HEADLESS=1, andISAACLAB_TEST_MULTI_GPU=1to enable the cuda:1 tests.
Observations
- The PR description and commit comments are thorough — clearly documents the purpose, local repro steps, and bug surface.
- Marked as STRESS/DO NOT MERGE appropriately.
- The
infrastructurelabel is correct.
Suggestions
-
Consider adding a comment in the workflow file noting that the Isaac Sim version pin should be kept in sync with
pyproject.tomlonce the 3.12 baseline is fully resolved upstream. -
The
multi-gpurunner assumption (≥2 GPUs) is validated in the "Verify multi-GPU availability" step, which is good defensive programming.
No blocking issues. This is a well-structured stress test PR for reproducing CI behavior around FabricFrameView cuda:1 hangs.
Update (6afe94e): Reviewed incremental changes. The new commit narrows the test scope from the full test file to the specific hanging test case (test_fabric_cuda1_world_pose_roundtrip[cuda:1]) and simplifies the explanatory comment. Sensible refinement for focused debugging. No new issues.
Update (25b3f03): Extensive diagnostic instrumentation added across the codebase:
- Workflow: Added
-s --tb=shortpytest flags for immediate output and shorter tracebacks during hangs. - fabric_frame_view.py: Added
_fabric_diag()helper with flush-safe prints throughoutset_world_poses(),get_world_poses(),_prepare_for_reuse(),_rebuild_fabric_arrays(), and_initialize_fabric()— traces entry/exit and each major operation. - test file: Added
faulthandler.dump_traceback_later(120, repeat=True)to auto-dump traceback on hang, plus_diag()calls throughout fixture and test factory.
All changes are debugging instrumentation to identify exactly where cuda:1 hangs occur. No functional changes, appropriate for a STRESS test. No new issues.
Update (582813b): Added source/isaaclab_physx/changelog.d/pv-fabric-mgpu-ci-stress.skip — empty skip file to suppress changelog generation for this stress test branch. Appropriate for a DO NOT MERGE debugging PR. No new issues.
There was a problem hiding this comment.
Update (582813b): Added source/isaaclab_physx/changelog.d/pv-fabric-mgpu-ci-stress.skip — an empty skip file to disable changelog generation for this stress test PR. Appropriate for a DO NOT MERGE debugging branch. No issues.
Update (82f9ab7): Added CI step to reinstall pytest/test harness after Isaac Sim wheel install (addresses dependency resolution issues). Sensible workaround for this stress test branch. No issues.
Update (7e418e2): CI hardening improvements:
- Removed
coveragefrom reinstall to avoid resolver conflict withisaacsim-kernel's pinned version - Added robust GPU count parsing (
awkfilter + empty check) to handle spurious output - Added
set -o pipefailand explicit pass-count validation to catch silent test failures
All sensible defensive changes for this stress test branch. No issues.
Update (78f38c0): Added version diagnostics block to test_views_xform_prim_fabric.py:
- Prints Kit version, kernel version, git hash at test startup
- Logs enabled extension versions for
omni.fabric.core,omni.usdrt.core,omni.usdrt.scenegraph, andomni.physx - Cleans up temporary variables after printing
This is helpful debugging instrumentation for investigating multi-GPU CI issues. No code concerns — purely additive diagnostics that won't affect test behavior. ✅
Update (0699d8f): Major CI infrastructure change — switched from pip-installed Isaac Sim to Docker-based approach:
- Added
configjob to load Isaac Sim image name/tag from config.yaml - Added
build-basejob usingecr-build-push-pullaction for Docker image - Replaced manual
pip install isaacsimwith containerized test execution via.github/actions/run-tests - Simplified GPU check to use direct
nvidia-smiquery instead of Python/torch - Added proper test result parsing from JUnit XML reports
This aligns with build.yaml's approach to ensure tests run against the Kit version baked into the Isaac Sim container. Good architectural change for CI consistency. ✅
Adds version diagnostics at module load to confirm which Kit kernel, Fabric, and UsdRT versions are actually running inside the CI container.
The previous workflow installed isaacsim==6.0.0 via pip on bare metal, which pulled Kit 110.0 regardless of what the container image shipped. This meant tests were running against a stale Kit version instead of the latest-develop (Kit 111.0). Rewrites the workflow to use the same Docker-based pattern as build.yaml: - Build/pull the base image from ECR (nvcr.io/nvidian/isaac-sim:latest-develop) - Run pytest inside the container via the run-tests action - Volume-mount the workspace so the PR's source is tested This ensures the test environment matches what other CI jobs use and tests always run against the Kit version in the container.
The previous commit used run-tests directly, but that action expects the Docker image to already be available locally. Since the build and test jobs can land on different runners, the image must be pulled from ECR first. Switch to using the run-package-tests composite action (same as all other test jobs in build.yaml), which handles: 1. Pulling the image from ECR via ecr-build-push-pull 2. Running tests inside the container via run-tests 3. Uploading artifacts and checking results
1. Summary
developfrompv-nvidia:pv/fabric-mgpu-ci-stress.ISAACLAB_TEST_MULTI_GPU=1so the cuda:1 tests added by Enable mgpu in FrameView #5514 run in CI../isaaclab.sh --install none, and Isaac Sim fromhttps://pypi.nvidia.com.2. Status
Draft / stress-test PR. This is intended to reproduce and monitor the FabricFrameView cuda:1 behavior in CI, not to be merged until the runtime behavior is understood/fixed.
3. Local reproduction
4. Bug surface
source/isaaclab_physx/isaaclab_physx/sim/views/fabric_frame_view.pySelectPrims/ FabricFrameView initialization with non-zero CUDA device indices.cuda:0-only Fabric allowlist because USDRT was expected to support arbitrary CUDA device indices.5. Test plan
ISAACLAB_TEST_MULTI_GPU=1.Recreated from #5788.