[MGPU] Sim: honor device kwarg over sim_cfg.device in build_simulation_context#5881
Conversation
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
This PR fixes build_simulation_context silently dropping the device kwarg when sim_cfg is also provided, which caused warp kernel-launch device mismatches on non-default GPUs. The fix changes device's default from "cuda:0" to None (sentinel) and applies it as an override on sim_cfg.device when explicitly passed. The approach is correct and well-tested.
Design Assessment
Design is sound. Using None as a sentinel to distinguish "caller explicitly chose a device" from "caller did not specify" is the standard Python pattern for this problem. Applying the override after sim_cfg resolution (whether freshly built or passed in) correctly handles both branches with a single if device is not None check. This is minimal, targeted, and preserves backward compatibility for all callers.
Findings
🟡 Warning: In-place mutation of caller-provided sim_cfg — source/isaaclab/isaaclab/sim/simulation_context.py:1001
When a caller passes their own sim_cfg object and a device kwarg, the code mutates the caller's config in-place (sim_cfg.device = device). If a caller reuses the same SimulationCfg instance across multiple calls (e.g., in a test loop with different devices), the object will be permanently modified after the first call. This is visible in the test itself — the second build_simulation_context(sim_cfg=cfg, device="cpu") block mutates cfg.device to "cpu".
In practice, this is unlikely to cause issues because:
- The context manager yields then cleans up, so the mutated config is typically not reused afterward.
- Existing callers in the codebase construct
SimulationCfgfresh or don't reuse after the context.
However, documenting this behavior in the docstring (e.g., "Note: when device is provided alongside sim_cfg, sim_cfg.device is modified in-place") would prevent future surprises. Alternatively, a defensive copy.copy(sim_cfg) before mutation would make the API side-effect-free, but that may be over-engineering for a test helper.
🔵 Suggestion: Consider documenting the mutation side-effect — source/isaaclab/isaaclab/sim/simulation_context.py:971
Adding a brief note in the Args docstring would make the behavior explicit:
device: Device to run the simulation on. When given alongside ``sim_cfg``,
overrides ``sim_cfg.device`` **in-place** so the caller's explicit choice wins.
Test Coverage
- Bug fix: Yes — regression test directly exercises the fixed code path (
build_simulation_context(sim_cfg=cfg, device="cpu")assertssim.cfg.device == "cpu"). - New code: Yes — both the sim_cfg-only path and the device-override path are tested.
- Gaps: None critical. The existing parametrized
test_build_simulation_context_no_cfgtest withdevice=["cuda:0", "cpu"]continues to verify the no-cfg path works correctly with the new sentinel default.
CI Status
Core checks (pre-commit, build wheel, changelog fragments, labeler) are passing. Docker/installation/docs builds are pending — these are infrastructure checks unrelated to this code change.
Verdict
Minor fixes needed
The implementation correctly solves the device-mismatch bug with a clean, minimal approach. The only concern is the undocumented in-place mutation of caller-provided sim_cfg, which is a minor API hygiene issue rather than a correctness problem. The test coverage adequately exercises both code paths and would catch regressions.
Update (c02b0e6): Significant new changes since last review. The PR has expanded well beyond the original build_simulation_context device fix to include broader MGPU infrastructure:
New: isaaclab.app.sim_launcher module
A new 427-line module (source/isaaclab/isaaclab/app/sim_launcher.py) that consolidates simulation launch logic:
scan()walks a config tree once to detect physics backend, renderer, and camera requirementslaunch_simulation()context manager decides Kit vs kitless path, validates backend combos, auto-enables camerasmake_physics_cfg()factory for physics backend selectionadd_launcher_args()delegates toAppLauncher.add_app_launcher_args
The design is clean: single-walk, dataclass output, clear separation of scan vs decision. Good error messages for invalid combinations (OVRTX + Kit, OvPhysX + Kit visualizer).
New: ISAACLAB_PIN_KIT_GPU env var
Appends --/renderer/multiGpu/enabled=False, --/renderer/multiGpu/autoEnable=False, --/renderer/multiGpu/maxGpuCount=1, and --/physics/fabricUseGPUInterop=false to Kit CLI when truthy. Solves multi-GPU CI shard isolation (Stage X already attached / SimulationApp.close hangs). Well-documented with issue references.
Breaking: get_scales() returns ProxyArray
All frame view backends (Newton, OvPhysX, PhysX/Fabric, USD) now return ProxyArray instead of raw wp.array from get_scales(). Consistent with pose getters. Changelog correctly documents the break.
Other notable changes in this push:
- Deferred
pxrimports: Massive refactor movingfrom pxr import ...from module-level to function bodies (behindTYPE_CHECKINGfor type hints). Prevents Kit USD binding registration errors when importing isaaclab beforeAppLauncher. resolve_matching_prims_from_sourceraises by default: Newraise_if_no_matches=Trueparameter eliminates boilerplate null-check patterns across ~15 call sites.get_all_matching_child_primsgainsexpected_num_matches: Consolidates repeated count-then-raise patterns.resample_interval_on_resetonEventTermCfg: New flag to preserve interval timers across resets. Well-tested.reshape_data_to_view_3dtorch support: All three backends now accepttorch.Tensorin addition towp.array.- RTX scene partitioning:
prepare_stageinIsaacRtxRendererauthors per-envomni:scenePartitionattributes. __del__fix: Usessys.is_finalizingas default arg to avoid import errors during finalization.- Kit version logging at startup.
- Task path reorganization:
isaaclab_tasks.manager_based.*toisaaclab_tasks.core.*/isaaclab_tasks.contrib.* - Version bumps: isaaclab 6.2.0, newton 0.14.0, physx 1.1.1, ov 0.4.2, experimental 0.1.0, contrib 0.4.1, mimic 1.3.1
- PhysX tensor import fix:
omni.physics.tensors.apitoomni.physics.tensors(wheel-installed Isaac Sim compat)
Incremental Findings
The original findings from the first review still apply. Additional observations:
- The deferred imports refactor is consistent and correct across all affected modules.
- The
raise_if_no_matchesrefactor is safe; callers that need optional matching explicitly passFalse. sim_launcher.pyrecursive config walk handles cycles viaid()tracking and catchesTypeErrorfromvars()on non-dict objects.
Updated Verdict
Approve — The original device-fix is unchanged and correct. The expanded scope adds well-designed MGPU infrastructure with proper changelog entries, version bumps, test coverage, and backward-compatible API evolution (breaking changes are documented). The PR is large but cohesive around the multi-GPU theme.
Update (86f8ea2): Large incremental push reviewed. Changes include: cartpole camera env consolidation (channel-first observations, frame-stack merge into CartpoleCameraEnv, preset-driven configs), isaaclab_experimental lazy-import refactoring (lazy_export() + .pyi stubs), deprecated ls_parallel removal from Newton solver configs, new NewtonSDFCollisionPropertiesCfg, rsl_rl CNNModel for image-only policies, version bumps across all packages, and numerous test updates.
Previous findings status:
- 🟡 In-place mutation of
sim_cfg: Not addressed (original comment still visible). The improved test split (sim_cfg-only vs device-override) in the new commit implicitly documents the intended behavior through code, which partially mitigates the concern.
New issues: None. The incremental changes are well-structured, consistent with the existing patterns, and do not introduce regressions. The channel-first camera observation refactor is properly propagated to all agent configs (rl_games, skrl, rsl_rl).
Update (799192a): The in-place mutation issue flagged in the original review has been resolved. The author replaced sim_cfg.device = device with sim_cfg = sim_cfg.replace(device=device), which returns a copy rather than mutating the caller's object. This makes the API side-effect-free — a caller-owned sim_cfg is no longer permanently modified when device is passed alongside it. The comment explaining the rationale is clear and accurate. All original findings are now addressed. Approve ✅
Update (7d3db80): The .replace() pattern (approved in previous update) has been reverted back to in-place mutation (sim_cfg.device = device). The two-line comment explaining the copy semantics was also removed. This re-introduces the original 🟡 warning: a caller-owned sim_cfg will be permanently mutated when device is passed alongside it. Whether this is intentional (e.g., maintainer feedback preferring simplicity) or accidental is unclear — please confirm this is the desired behavior.
Per per-PR minimum-needed analysis: - isaac-sim#5886 (bounded shutdown) is closed (audit verdict nice-to-have; isaac-sim#5933 prevents the hang upstream so the force-exit timer is moot). Reverts SIGHUP handler + ISAACLAB_FORCE_EXIT_TIMEOUT timer in AppLauncher; drops the workflow env var. - isaac-sim#5883 (kitless newton) kept open as a separate PR but left out of this diagnostic bundle to test whether isaac-sim#5933 alone is enough for newton test_articulation (which calls build_simulation_context(sim_cfg=, device=) at line 2427, so still needs isaac-sim#5881 for the cross-device kwarg fix). Reverts the newton test_articulation kitless conversion and the schemas.py _create_fixed_joint_to_world helper. Bundle now contains: isaac-sim#5823 + isaac-sim#5875 base + isaac-sim#5881 + isaac-sim#5933 + the JUnit XML path-collision fix in conftest. If green, confirms only 4 PRs are needed for multi-GPU CI green (with test_articulation un-gated).
c02b0e6 to
9571074
Compare
Most test callers pass both ``sim_cfg=`` and ``device=`` to
:func:`isaaclab.sim.build_simulation_context`, implicitly expecting the
``device`` kwarg to win. The helper previously dropped the kwarg silently
when ``sim_cfg`` was provided, causing warp kernel-launch device
mismatches on non-default GPUs: the test fixture allocated ``env_ids``
on the requested device while the articulation's ``self.device``
resolved from the untouched ``sim_cfg`` default (``cuda:0``), and
``wp.launch(..., device=self.device)`` failed with::
RuntimeError: Error launching kernel 'set_root_link_pose_to_sim_index',
trying to launch on device='cuda:0',
but input array for argument 'env_ids' is on device=cuda:2.
Change ``device``'s default to ``None`` (sentinel) and apply it as an
override after sim_cfg construction in both branches. The one test that
asserted the old "sim_cfg overrides everything" contract is updated to
cover the new override semantics.
9571074 to
86f8ea2
Compare
Greptile SummaryThis PR fixes
Confidence Score: 3/5The core bug fix is correct, but the implementation mutates the caller's The function now permanently changes source/isaaclab/isaaclab/sim/simulation_context.py — the Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["build_simulation_context(device=..., sim_cfg=...)"] --> B{sim_cfg is None?}
B -- Yes --> C["sim_cfg = SimulationCfg(dt=dt, gravity=gravity)"]
B -- No --> D["use caller-provided sim_cfg"]
C --> E{device is not None?}
D --> E
E -- Yes --> F["sim_cfg.device = device\n⚠️ mutates caller's object"]
E -- No --> G["sim_cfg.device unchanged"]
F --> H["SimulationContext(sim_cfg)"]
G --> H
H --> I["yield sim"]
Reviews (1): Last reviewed commit: "Honor device kwarg over sim_cfg.device i..." | Re-trigger Greptile |
| if device is not None: | ||
| # Honor the explicit device kwarg in both branches: when sim_cfg is | ||
| # freshly built, this picks the device; when sim_cfg is passed in, | ||
| # this overrides its (possibly default) device. Without the override, | ||
| # callers passing both ``sim_cfg=<built-with-default-device>`` and | ||
| # ``device=cuda:N`` silently got sim_cfg's device, causing warp | ||
| # kernel-launch mismatches when test fixtures allocated tensors on | ||
| # the requested device while assets resolved their device from the | ||
| # untouched sim_cfg. | ||
| sim_cfg.device = device |
There was a problem hiding this comment.
Caller's
sim_cfg is mutated in-place. When a caller holds a reference to their SimulationCfg and passes it with an explicit device, sim_cfg.device = device permanently changes the object. If the same cfg instance is reused in a subsequent build_simulation_context call without a device kwarg, it will silently inherit the overridden device rather than the originally configured one. @configclass objects expose a replace() method precisely for this pattern; creating a shallow copy avoids the side effect.
| if device is not None: | |
| # Honor the explicit device kwarg in both branches: when sim_cfg is | |
| # freshly built, this picks the device; when sim_cfg is passed in, | |
| # this overrides its (possibly default) device. Without the override, | |
| # callers passing both ``sim_cfg=<built-with-default-device>`` and | |
| # ``device=cuda:N`` silently got sim_cfg's device, causing warp | |
| # kernel-launch mismatches when test fixtures allocated tensors on | |
| # the requested device while assets resolved their device from the | |
| # untouched sim_cfg. | |
| sim_cfg.device = device | |
| if device is not None: | |
| # Honor the explicit device kwarg in both branches: when sim_cfg is | |
| # freshly built, this picks the device; when sim_cfg is passed in, | |
| # this overrides its (possibly default) device. Without the override, | |
| # callers passing both ``sim_cfg=<built-with-default-device>`` and | |
| # ``device=cuda:N`` silently got sim_cfg's device, causing warp | |
| # kernel-launch mismatches when test fixtures allocated tensors on | |
| # the requested device while assets resolved their device from the | |
| # untouched sim_cfg. | |
| sim_cfg = sim_cfg.replace(device=device) |
There was a problem hiding this comment.
Good catch — fixed in 799192a. Switched to sim_cfg = sim_cfg.replace(device=device) so a caller-owned config is no longer mutated in place and stays reusable across calls.
There was a problem hiding this comment.
Perfect — .replace() is exactly the right pattern here. Clean fix. 👍
There was a problem hiding this comment.
Perfect, using .replace() is exactly the right approach here. The updated comment explaining the rationale is helpful too. Thanks for the quick fix! 👍
Use SimulationCfg.replace(device=...) instead of assigning sim_cfg.device in place, so a caller-owned config passed into build_simulation_context is not permanently mutated and stays reusable across subsequent calls.
|
|
||
| if sim_cfg is None: | ||
| gravity = (0.0, 0.0, -9.81) if gravity_enabled else (0.0, 0.0, 0.0) | ||
| sim_cfg = SimulationCfg(device=device, dt=dt, gravity=gravity) |
There was a problem hiding this comment.
maybe just SimulationCfg(dt=dt, gravity=gravity, device=device)?
There was a problem hiding this comment.
actually do
if device:
sim_cfg.device=device
replace does an extra copy iirc
Revert the device override from sim_cfg.replace(device=device) to a plain in-place assignment. replace() reconstructs the whole config (re-running the configclass post-init tree walk, ~50-76us vs ~0.03us for a single attribute set), and the copy it produced guarded against a reuse hazard that no caller hits: build_simulation_context is a test/benchmark helper, the only call sites that reuse a SimulationCfg object already mutate it in-place themselves before passing it, and SimulationContext mutates the cfg it receives regardless.
|
Follow-up review (commit 7d3db80): Reviewed the new commit that reverts The commit message provides reasonable justification:
Given this is a testing utility and the performance concern is valid, the revert is acceptable. No new issues introduced. ✅ |
Summary
build_simulation_context(device="cuda:N", sim_cfg=...)silently dropped the explicitdevicekwarg when asim_cfgwas passed, falling back tosim_cfg.device(defaultcuda:0).ISAACLAB_SIM_DEVICE=cuda:Nper shard, so tests that passdevice="cuda:N"gotcuda:0instead. Downstream Warp kernels then ran oncuda:0while the rest of the test believed it was oncuda:N:device's defaultNone(sentinel) and apply it as an override aftersim_cfgis resolved, so an explicit kwarg wins whether or not asim_cfgwas supplied.1. The bug
When a caller passed both
sim_cfg=<built with default device>anddevice="cuda:2", the kwarg was thrown away. Code that pulled the active device fromsim_cfg.devicesawcuda:0; Warp arrays allocated against the cfg device landed oncuda:0while torch ops driven by the kwarg ran oncuda:2— the cross-device kernel-launch error above.2. Fix
device=None(default) means "use whatever the cfg already has".device="cuda:N"is honored even when a cfg is also passed.3. Validation
source/isaaclab/test/sim/test_build_simulation_context_{headless,nonheadless}.py::test_build_simulation_context_cfgis updated to assert the new override semantics (explicitdevicewins oversim_cfg.device). On local multi-GPU/MIG hardware,build_simulation_context(sim_cfg=cfg, device="cuda:N")previously hit the kernel-launch assertion; with the fix it runs on the requested device. Consumed by the multi-GPU CI lane (#5823).