Skip to content

[MGPU] Sim: honor device kwarg over sim_cfg.device in build_simulation_context#5881

Merged
ooctipus merged 3 commits into
isaac-sim:developfrom
hujc7:jichuanh/fix-build-sim-context-device
Jun 9, 2026
Merged

[MGPU] Sim: honor device kwarg over sim_cfg.device in build_simulation_context#5881
ooctipus merged 3 commits into
isaac-sim:developfrom
hujc7:jichuanh/fix-build-sim-context-device

Conversation

@hujc7

@hujc7 hujc7 commented May 30, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • build_simulation_context(device="cuda:N", sim_cfg=...) silently dropped the explicit device kwarg when a sim_cfg was passed, falling back to sim_cfg.device (default cuda:0).
  • The multi-GPU CI lane sets ISAACLAB_SIM_DEVICE=cuda:N per shard, so tests that pass device="cuda:N" got cuda:0 instead. Downstream Warp kernels then ran on cuda:0 while the rest of the test believed it was on cuda:N:
    RuntimeError: Error launching kernel 'set_root_link_pose_to_sim_index',
    trying to launch on device='cuda:0',
    but input array for argument 'env_ids' is on device=cuda:2.
    
  • Fix: make device's default None (sentinel) and apply it as an override after sim_cfg is resolved, so an explicit kwarg wins whether or not a sim_cfg was supplied.

1. The bug

def build_simulation_context(..., device: str = "cuda:0", sim_cfg=None, ...):
    if sim_cfg is None:
        sim_cfg = SimulationCfg(device=device, ...)
    # ^^ explicit `device` only used in the no-sim_cfg path; otherwise ignored

When a caller passed both sim_cfg=<built with default device> and device="cuda:2", the kwarg was thrown away. Code that pulled the active device from sim_cfg.device saw cuda:0; Warp arrays allocated against the cfg device landed on cuda:0 while torch ops driven by the kwarg ran on cuda:2 — the cross-device kernel-launch error above.

2. Fix

def build_simulation_context(..., device: str | None = None, sim_cfg=None, ...):
    if sim_cfg is None:
        gravity = (0.0, 0.0, -9.81) if gravity_enabled else (0.0, 0.0, 0.0)
        sim_cfg = SimulationCfg(dt=dt, gravity=gravity)
    if device is not None:
        sim_cfg.device = device   # explicit kwarg wins in both branches

device=None (default) means "use whatever the cfg already has". device="cuda:N" is honored even when a cfg is also passed.

3. Validation

source/isaaclab/test/sim/test_build_simulation_context_{headless,nonheadless}.py::test_build_simulation_context_cfg is updated to assert the new override semantics (explicit device wins over sim_cfg.device). On local multi-GPU/MIG hardware, build_simulation_context(sim_cfg=cfg, device="cuda:N") previously hit the kernel-launch assertion; with the fix it runs on the requested device. Consumed by the multi-GPU CI lane (#5823).

@github-actions github-actions Bot added bug Something isn't working isaac-lab Related to Isaac Lab team labels May 30, 2026

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This PR fixes build_simulation_context silently dropping the device kwarg when sim_cfg is also provided, which caused warp kernel-launch device mismatches on non-default GPUs. The fix changes device's default from "cuda:0" to None (sentinel) and applies it as an override on sim_cfg.device when explicitly passed. The approach is correct and well-tested.

Design Assessment

Design is sound. Using None as a sentinel to distinguish "caller explicitly chose a device" from "caller did not specify" is the standard Python pattern for this problem. Applying the override after sim_cfg resolution (whether freshly built or passed in) correctly handles both branches with a single if device is not None check. This is minimal, targeted, and preserves backward compatibility for all callers.

Findings

🟡 Warning: In-place mutation of caller-provided sim_cfgsource/isaaclab/isaaclab/sim/simulation_context.py:1001

When a caller passes their own sim_cfg object and a device kwarg, the code mutates the caller's config in-place (sim_cfg.device = device). If a caller reuses the same SimulationCfg instance across multiple calls (e.g., in a test loop with different devices), the object will be permanently modified after the first call. This is visible in the test itself — the second build_simulation_context(sim_cfg=cfg, device="cpu") block mutates cfg.device to "cpu".

In practice, this is unlikely to cause issues because:

  1. The context manager yields then cleans up, so the mutated config is typically not reused afterward.
  2. Existing callers in the codebase construct SimulationCfg fresh or don't reuse after the context.

However, documenting this behavior in the docstring (e.g., "Note: when device is provided alongside sim_cfg, sim_cfg.device is modified in-place") would prevent future surprises. Alternatively, a defensive copy.copy(sim_cfg) before mutation would make the API side-effect-free, but that may be over-engineering for a test helper.

🔵 Suggestion: Consider documenting the mutation side-effectsource/isaaclab/isaaclab/sim/simulation_context.py:971

Adding a brief note in the Args docstring would make the behavior explicit:

device: Device to run the simulation on. When given alongside ``sim_cfg``,
    overrides ``sim_cfg.device`` **in-place** so the caller's explicit choice wins.

Test Coverage

  • Bug fix: Yes — regression test directly exercises the fixed code path (build_simulation_context(sim_cfg=cfg, device="cpu") asserts sim.cfg.device == "cpu").
  • New code: Yes — both the sim_cfg-only path and the device-override path are tested.
  • Gaps: None critical. The existing parametrized test_build_simulation_context_no_cfg test with device=["cuda:0", "cpu"] continues to verify the no-cfg path works correctly with the new sentinel default.

CI Status

Core checks (pre-commit, build wheel, changelog fragments, labeler) are passing. Docker/installation/docs builds are pending — these are infrastructure checks unrelated to this code change.

Verdict

Minor fixes needed

The implementation correctly solves the device-mismatch bug with a clean, minimal approach. The only concern is the undocumented in-place mutation of caller-provided sim_cfg, which is a minor API hygiene issue rather than a correctness problem. The test coverage adequately exercises both code paths and would catch regressions.


Update (c02b0e6): Significant new changes since last review. The PR has expanded well beyond the original build_simulation_context device fix to include broader MGPU infrastructure:

New: isaaclab.app.sim_launcher module

A new 427-line module (source/isaaclab/isaaclab/app/sim_launcher.py) that consolidates simulation launch logic:

  • scan() walks a config tree once to detect physics backend, renderer, and camera requirements
  • launch_simulation() context manager decides Kit vs kitless path, validates backend combos, auto-enables cameras
  • make_physics_cfg() factory for physics backend selection
  • add_launcher_args() delegates to AppLauncher.add_app_launcher_args

The design is clean: single-walk, dataclass output, clear separation of scan vs decision. Good error messages for invalid combinations (OVRTX + Kit, OvPhysX + Kit visualizer).

New: ISAACLAB_PIN_KIT_GPU env var

Appends --/renderer/multiGpu/enabled=False, --/renderer/multiGpu/autoEnable=False, --/renderer/multiGpu/maxGpuCount=1, and --/physics/fabricUseGPUInterop=false to Kit CLI when truthy. Solves multi-GPU CI shard isolation (Stage X already attached / SimulationApp.close hangs). Well-documented with issue references.

Breaking: get_scales() returns ProxyArray

All frame view backends (Newton, OvPhysX, PhysX/Fabric, USD) now return ProxyArray instead of raw wp.array from get_scales(). Consistent with pose getters. Changelog correctly documents the break.

Other notable changes in this push:

  • Deferred pxr imports: Massive refactor moving from pxr import ... from module-level to function bodies (behind TYPE_CHECKING for type hints). Prevents Kit USD binding registration errors when importing isaaclab before AppLauncher.
  • resolve_matching_prims_from_source raises by default: New raise_if_no_matches=True parameter eliminates boilerplate null-check patterns across ~15 call sites.
  • get_all_matching_child_prims gains expected_num_matches: Consolidates repeated count-then-raise patterns.
  • resample_interval_on_reset on EventTermCfg: New flag to preserve interval timers across resets. Well-tested.
  • reshape_data_to_view_3d torch support: All three backends now accept torch.Tensor in addition to wp.array.
  • RTX scene partitioning: prepare_stage in IsaacRtxRenderer authors per-env omni:scenePartition attributes.
  • __del__ fix: Uses sys.is_finalizing as default arg to avoid import errors during finalization.
  • Kit version logging at startup.
  • Task path reorganization: isaaclab_tasks.manager_based.* to isaaclab_tasks.core.* / isaaclab_tasks.contrib.*
  • Version bumps: isaaclab 6.2.0, newton 0.14.0, physx 1.1.1, ov 0.4.2, experimental 0.1.0, contrib 0.4.1, mimic 1.3.1
  • PhysX tensor import fix: omni.physics.tensors.api to omni.physics.tensors (wheel-installed Isaac Sim compat)

Incremental Findings

The original findings from the first review still apply. Additional observations:

  • The deferred imports refactor is consistent and correct across all affected modules.
  • The raise_if_no_matches refactor is safe; callers that need optional matching explicitly pass False.
  • sim_launcher.py recursive config walk handles cycles via id() tracking and catches TypeError from vars() on non-dict objects.

Updated Verdict

Approve — The original device-fix is unchanged and correct. The expanded scope adds well-designed MGPU infrastructure with proper changelog entries, version bumps, test coverage, and backward-compatible API evolution (breaking changes are documented). The PR is large but cohesive around the multi-GPU theme.


Update (86f8ea2): Large incremental push reviewed. Changes include: cartpole camera env consolidation (channel-first observations, frame-stack merge into CartpoleCameraEnv, preset-driven configs), isaaclab_experimental lazy-import refactoring (lazy_export() + .pyi stubs), deprecated ls_parallel removal from Newton solver configs, new NewtonSDFCollisionPropertiesCfg, rsl_rl CNNModel for image-only policies, version bumps across all packages, and numerous test updates.

Previous findings status:

  • 🟡 In-place mutation of sim_cfg: Not addressed (original comment still visible). The improved test split (sim_cfg-only vs device-override) in the new commit implicitly documents the intended behavior through code, which partially mitigates the concern.

New issues: None. The incremental changes are well-structured, consistent with the existing patterns, and do not introduce regressions. The channel-first camera observation refactor is properly propagated to all agent configs (rl_games, skrl, rsl_rl).


Update (799192a): The in-place mutation issue flagged in the original review has been resolved. The author replaced sim_cfg.device = device with sim_cfg = sim_cfg.replace(device=device), which returns a copy rather than mutating the caller's object. This makes the API side-effect-free — a caller-owned sim_cfg is no longer permanently modified when device is passed alongside it. The comment explaining the rationale is clear and accurate. All original findings are now addressed. Approve


Update (7d3db80): The .replace() pattern (approved in previous update) has been reverted back to in-place mutation (sim_cfg.device = device). The two-line comment explaining the copy semantics was also removed. This re-introduces the original 🟡 warning: a caller-owned sim_cfg will be permanently mutated when device is passed alongside it. Whether this is intentional (e.g., maintainer feedback preferring simplicity) or accidental is unclear — please confirm this is the desired behavior.

@hujc7 hujc7 changed the title [Sim] Honor device kwarg over sim_cfg in build_simulation_context [MGPU] Sim: honor device kwarg over sim_cfg.device in build_simulation_context Jun 2, 2026
hujc7 added a commit to hujc7/IsaacLab that referenced this pull request Jun 3, 2026
Per per-PR minimum-needed analysis:
- isaac-sim#5886 (bounded shutdown) is closed (audit verdict nice-to-have;
  isaac-sim#5933 prevents the hang upstream so the force-exit timer is moot).
  Reverts SIGHUP handler + ISAACLAB_FORCE_EXIT_TIMEOUT timer in
  AppLauncher; drops the workflow env var.
- isaac-sim#5883 (kitless newton) kept open as a separate PR but left out of
  this diagnostic bundle to test whether isaac-sim#5933 alone is enough for
  newton test_articulation (which calls
  build_simulation_context(sim_cfg=, device=) at line 2427, so still
  needs isaac-sim#5881 for the cross-device kwarg fix). Reverts the newton
  test_articulation kitless conversion and the schemas.py
  _create_fixed_joint_to_world helper.

Bundle now contains: isaac-sim#5823 + isaac-sim#5875 base + isaac-sim#5881 + isaac-sim#5933 + the JUnit
XML path-collision fix in conftest. If green, confirms only 4 PRs
are needed for multi-GPU CI green (with test_articulation un-gated).
@hujc7 hujc7 force-pushed the jichuanh/fix-build-sim-context-device branch 2 times, most recently from c02b0e6 to 9571074 Compare June 5, 2026 02:26
Most test callers pass both ``sim_cfg=`` and ``device=`` to
:func:`isaaclab.sim.build_simulation_context`, implicitly expecting the
``device`` kwarg to win. The helper previously dropped the kwarg silently
when ``sim_cfg`` was provided, causing warp kernel-launch device
mismatches on non-default GPUs: the test fixture allocated ``env_ids``
on the requested device while the articulation's ``self.device``
resolved from the untouched ``sim_cfg`` default (``cuda:0``), and
``wp.launch(..., device=self.device)`` failed with::

    RuntimeError: Error launching kernel 'set_root_link_pose_to_sim_index',
    trying to launch on device='cuda:0',
    but input array for argument 'env_ids' is on device=cuda:2.

Change ``device``'s default to ``None`` (sentinel) and apply it as an
override after sim_cfg construction in both branches. The one test that
asserted the old "sim_cfg overrides everything" contract is updated to
cover the new override semantics.
@hujc7 hujc7 force-pushed the jichuanh/fix-build-sim-context-device branch from 9571074 to 86f8ea2 Compare June 5, 2026 08:54
@hujc7 hujc7 marked this pull request as ready for review June 5, 2026 20:42
@hujc7 hujc7 requested a review from ooctipus June 5, 2026 20:42
@greptile-apps

greptile-apps Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes build_simulation_context silently ignoring an explicit device kwarg when a sim_cfg is also passed, which caused cross-device Warp kernel-launch failures on multi-GPU CI shards. The fix changes device default from "cuda:0" to None and applies a post-resolution override in both the fresh-cfg and caller-supplied-cfg paths.

  • Core fix (simulation_context.py): device default becomes None; the override is applied after sim_cfg is resolved, so an explicit kwarg wins in both code paths.
  • Tests (test_build_simulation_context_{headless,nonheadless}.py): two new assertion blocks verify that sim_cfg alone determines all fields when device is omitted, and that an explicit device kwarg wins over sim_cfg.device.
  • Side effect risk: the fix mutates the caller's sim_cfg object in-place; a reused cfg instance will retain the overridden device on subsequent calls without an explicit device arg.

Confidence Score: 3/5

The core bug fix is correct, but the implementation mutates the caller's sim_cfg object in-place, which can corrupt a reused cfg in downstream tests or production code.

The function now permanently changes sim_cfg.device on the object passed in by the caller. Any test or application that builds one SimulationCfg, calls build_simulation_context with it and an explicit device, then reuses the same cfg for a later call without a device kwarg will silently get the overridden device instead of the original. The existing tests don't catch this because the cfg is discarded after the device-override block. @configclass exposes replace() which produces a new copy and avoids the mutation entirely.

source/isaaclab/isaaclab/sim/simulation_context.py — the sim_cfg.device = device assignment on line 1010

Important Files Changed

Filename Overview
source/isaaclab/isaaclab/sim/simulation_context.py Fixes device kwarg being silently ignored when sim_cfg is passed; however, the fix mutates the caller's sim_cfg in-place rather than making a copy, which can corrupt a reused cfg object.
source/isaaclab/test/sim/test_build_simulation_context_headless.py Tests updated to assert the new override semantics; does not catch the in-place mutation because the cfg is not reused after the device-override block.
source/isaaclab/test/sim/test_build_simulation_context_nonheadless.py Mirror of the headless test with the same updated assertions; same gap — mutation of cfg is not exercised after the override block.
source/isaaclab/changelog.d/jichuanh-fix-build-sim-context-device.rst Changelog entry accurately describes the bug and fix.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["build_simulation_context(device=..., sim_cfg=...)"] --> B{sim_cfg is None?}
    B -- Yes --> C["sim_cfg = SimulationCfg(dt=dt, gravity=gravity)"]
    B -- No --> D["use caller-provided sim_cfg"]
    C --> E{device is not None?}
    D --> E
    E -- Yes --> F["sim_cfg.device = device\n⚠️ mutates caller's object"]
    E -- No --> G["sim_cfg.device unchanged"]
    F --> H["SimulationContext(sim_cfg)"]
    G --> H
    H --> I["yield sim"]
Loading

Reviews (1): Last reviewed commit: "Honor device kwarg over sim_cfg.device i..." | Re-trigger Greptile

Comment on lines +1001 to +1010
if device is not None:
# Honor the explicit device kwarg in both branches: when sim_cfg is
# freshly built, this picks the device; when sim_cfg is passed in,
# this overrides its (possibly default) device. Without the override,
# callers passing both ``sim_cfg=<built-with-default-device>`` and
# ``device=cuda:N`` silently got sim_cfg's device, causing warp
# kernel-launch mismatches when test fixtures allocated tensors on
# the requested device while assets resolved their device from the
# untouched sim_cfg.
sim_cfg.device = device

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Caller's sim_cfg is mutated in-place. When a caller holds a reference to their SimulationCfg and passes it with an explicit device, sim_cfg.device = device permanently changes the object. If the same cfg instance is reused in a subsequent build_simulation_context call without a device kwarg, it will silently inherit the overridden device rather than the originally configured one. @configclass objects expose a replace() method precisely for this pattern; creating a shallow copy avoids the side effect.

Suggested change
if device is not None:
# Honor the explicit device kwarg in both branches: when sim_cfg is
# freshly built, this picks the device; when sim_cfg is passed in,
# this overrides its (possibly default) device. Without the override,
# callers passing both ``sim_cfg=<built-with-default-device>`` and
# ``device=cuda:N`` silently got sim_cfg's device, causing warp
# kernel-launch mismatches when test fixtures allocated tensors on
# the requested device while assets resolved their device from the
# untouched sim_cfg.
sim_cfg.device = device
if device is not None:
# Honor the explicit device kwarg in both branches: when sim_cfg is
# freshly built, this picks the device; when sim_cfg is passed in,
# this overrides its (possibly default) device. Without the override,
# callers passing both ``sim_cfg=<built-with-default-device>`` and
# ``device=cuda:N`` silently got sim_cfg's device, causing warp
# kernel-launch mismatches when test fixtures allocated tensors on
# the requested device while assets resolved their device from the
# untouched sim_cfg.
sim_cfg = sim_cfg.replace(device=device)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 799192a. Switched to sim_cfg = sim_cfg.replace(device=device) so a caller-owned config is no longer mutated in place and stays reusable across calls.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect — .replace() is exactly the right pattern here. Clean fix. 👍

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, using .replace() is exactly the right approach here. The updated comment explaining the rationale is helpful too. Thanks for the quick fix! 👍

Use SimulationCfg.replace(device=...) instead of assigning
sim_cfg.device in place, so a caller-owned config passed into
build_simulation_context is not permanently mutated and stays
reusable across subsequent calls.
@hujc7 hujc7 added this to Isaac Lab Jun 5, 2026
@hujc7 hujc7 moved this to In review in Isaac Lab Jun 5, 2026

if sim_cfg is None:
gravity = (0.0, 0.0, -9.81) if gravity_enabled else (0.0, 0.0, 0.0)
sim_cfg = SimulationCfg(device=device, dt=dt, gravity=gravity)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just SimulationCfg(dt=dt, gravity=gravity, device=device)?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually do

if device:
     sim_cfg.device=device

replace does an extra copy iirc

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Revert the device override from sim_cfg.replace(device=device) to a
plain in-place assignment. replace() reconstructs the whole config
(re-running the configclass post-init tree walk, ~50-76us vs ~0.03us
for a single attribute set), and the copy it produced guarded against
a reuse hazard that no caller hits: build_simulation_context is a
test/benchmark helper, the only call sites that reuse a SimulationCfg
object already mutate it in-place themselves before passing it, and
SimulationContext mutates the cfg it receives regardless.
@isaaclab-review-bot

Copy link
Copy Markdown

Follow-up review (commit 7d3db80):

Reviewed the new commit that reverts .replace(device=device) back to direct in-place assignment.

The commit message provides reasonable justification:

  • .replace() reconstructs the entire config (~50-76μs) vs simple attribute set (~0.03μs)
  • build_simulation_context is a test/benchmark helper, not production API
  • Existing callers that reuse SimulationCfg already mutate it in-place
  • SimulationContext mutates the cfg it receives regardless

Given this is a testing utility and the performance concern is valid, the revert is acceptable. No new issues introduced. ✅

@ooctipus ooctipus merged commit 5189e47 into isaac-sim:develop Jun 9, 2026
37 checks passed
@github-project-automation github-project-automation Bot moved this from In review to Done in Isaac Lab Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working isaac-lab Related to Isaac Lab team

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants