Skip to content

[DRAFT][Newton][Factory] Prereqs for NutThread support#5835

Draft
mzamoramora-nvidia wants to merge 22 commits into
isaac-sim:developfrom
mzamoramora-nvidia:nut-thread-newton-prereqs
Draft

[DRAFT][Newton][Factory] Prereqs for NutThread support#5835
mzamoramora-nvidia wants to merge 22 commits into
isaac-sim:developfrom
mzamoramora-nvidia:nut-thread-newton-prereqs

Conversation

@mzamoramora-nvidia

@mzamoramora-nvidia mzamoramora-nvidia commented May 28, 2026

Copy link
Copy Markdown

Prereqs for Factory NutThread support on the Newton backend

Description

Aggregates the four IsaacLab changes required to support the Factory NutThread task on the Newton backend onto a single base branch. Each item is independently reviewable and would normally live in its own upstream PR; they are bundled here so the follow-up Factory branch (#5836 — Factory NutThread support on the Newton backend) can rebase against one base instead of juggling four moving targets while the upstream PRs land.

This branch is intentionally draft — once the constituent upstream PRs are merged, this branch will be dropped and the Factory PR will rebase onto develop directly. The bundle exists only to unblock end-to-end Factory training while review proceeds in parallel.

Relationship to the Factory NutThread PR

This PR is the base for #5836 — [Factory] NutThread support on the Newton backend. The commits here are also temporarily carried on the Factory branch so GitHub can render its diff; they belong to this PR's review surface.

Compare view (this PR's diff vs develop):
mzamoramora-nvidia/IsaacLab@develop...nut-thread-newton-prereqs

What this branch contains

Piece Upstream PR Author Status
SDF mesh collision + hydroelastic shape configuration #5228 @AntoineRichard open
NewtonCfg.collision_decimation (mid-tick re-collide) #5729 @mzamoramora-nvidia open
RigidObjectData shape fix for kinematic single-body assets n/a (new) @mzamoramora-nvidia this PR
NewtonManager.sanitize_world_state hook + deferred NaN-reset queue n/a (new) @mzamoramora-nvidia this PR

Topology:

develop
  ├── merge PR #5228 (SDF / hydroelastic cfg)
  ├── merge PR #5729 (collision_decimation)
  ├── Newton: fix RigidObjectData shape for kinematic single-body assets
  └── newton_manager: add sanitize_world_state hook for NaN recovery

The two new commits sit on top of the vendored PRs because both depend on NewtonCollisionPipelineCfg being available in NewtonCfg.

Detail per piece

PR #5228 — SDF / hydroelastic configuration (vendored as-is)

Adds SDFCfg and surfaces the HydroelasticSDFCfg fields needed to drive Newton's pressure-based contact pipeline (moment_matching, buffer_mult_*, grid_size). No local modifications; merged verbatim so the Factory branch can configure sdf_hydroelastic_config=HydroelasticSDFCfg(...) against the same struct shape that will land upstream.

PR #5729NewtonCfg.collision_decimation (vendored as-is)

Adds the opt-in collision_decimation knob that re-invokes the Newton collision pipeline every N solver substeps inside a single physics tick. Default 0 preserves legacy behaviour. The Factory NutThread newton_sdf preset uses collision_decimation=1 (re-collide every substep); the newton (hydroelastic) preset uses collision_decimation=2. These values were chosen empirically — tighter decimation kept the SDF preset stable on this task; hydroelastic tolerated a looser setting.

RigidObjectData shape fix for kinematic single-body assets (new)

Fixes :class:~isaaclab_newton.assets.RigidObjectData crashing with IndexError: tuple index out of range when a fixed-base rigid object holds a single body. The is_fixed_base branch in _create_simulation_bindings indexed [:, 0, 0] assuming a 3D (count, links, 1) layout, but Newton returns a 2D (count, links) array when the view contains a single body. Dispatch on actual ndim so both multi-link fixed-base articulations and single-body kinematic rigid objects work. Also widens _sim_bind_body_com_vel_w to (num_instances, 1) so the no-velocity fallback matches the derive_body_acceleration_from_body_com_velocities kernel's 2D signature.

This is hit immediately when spawning the Factory bolt as a kinematic single-body RigidObjectCfg (the natural Newton-side representation since the bolt is fixed in the world); other tasks that use the same kinematic single-body pattern would hit it too.

NewtonManager.sanitize_world_state hook + deferred NaN-reset queue (new)

Adds two classmethods to NewtonManager:

  • sanitize_world_state(env_ids) — zeroes per-world MJWarp solver internals (qacc_warmstart, qfrc_applied, xfrc_applied, qacc, cacc, cfrc_int, cfrc_ext) and Newton State velocity buffers for the given worlds. Discussed in newton-physics/newton#1266. The current list is intentionally defensive — we clear every MJWarp-internal buffer that could plausibly carry NaN state forward, and stopped tightening once recovery was reliable. Not all of these probably need to be reset; a follow-up should bisect which subset is actually load-bearing.
  • flag_nan_envs(mask) / sanitize_pending_nan_envs() — a tiny classvar-backed pending-reset queue. Tasks call flag_nan_envs when they detect NaN (e.g. in _get_observations), and the env's reset path drains the queue via sanitize_pending_nan_envs once per reset. This defers the scrub to the natural episode boundary instead of clobbering live warm-starts mid-rollout — earlier eager-sanitize attempts regressed hydroelastic training, and deferring the scrub fixed the regression.

The buffers scrubbed are MJWarp-specific; the hook has only been exercised against SolverMuJoCo so far. Other Newton solvers may need a different set of buffers, or none at all — that's an open question for when a non-MJWarp task first needs NaN recovery.

Why bundle this with the prereqs rather than upstream-first: the API surface (two classmethods + one classvar) is small but the placement (which buffers to scrub, where to call the hook) is task-driven; the Factory NutThread env is the first task that actually exercises it. Once the contract is validated by landing the Factory PR, splitting this out into a standalone upstream PR is straightforward — it doesn't depend on anything else here.

Type of change

  • New feature (non-breaking change which adds functionality) — net-new hooks and configclasses, no removal or signature change to existing public API.

Test plan

Check Result
./isaaclab.sh -p -m pytest source/isaaclab_newton/test/physics/test_newton_manager_abstraction.py 39/39 pass (PR #5729 tests, unchanged)
./isaaclab.sh -p -m pytest source/isaaclab_newton/test/physics/test_sdf_config.py 14/14 pass (PR #5228 tests, unchanged)
./isaaclab.sh -f clean (ruff, ruff-format, codespell, license, rst lint)
Factory smoke (drives the new code): Isaac-Factory-NutThread-Direct-v0 presets=newton reaches 100% success at epoch 71 in a 200-epoch run on top of the dependent Factory PR

Standalone regression tests for the two new commits — single-body kinematic RigidObjectCfg round-trip and sanitize_world_state NaN-recovery — are open follow-ups to keep this PR scoped to vendoring + the minimal API additions the Factory branch depends on.

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation (Sphinx docstrings on the new hook + classvar)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works — vendored PRs ship with theirs; standalone tests for the two new commits are open follow-ups (see Test plan)
  • I have added a changelog fragment under source/<pkg>/changelog.d/ for every touched package
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

ooctipus and others added 22 commits April 9, 2026 18:06
Add NewtonCollisionPipelineCfg and HydroelasticSDFCfg to expose
Newton's collision pipeline parameters (broad phase, contact limits,
hydroelastic mode) through NewtonCfg.collision_cfg.

Add MJWarpSolverCfg.tolerance for solver convergence control.

Fix validation order so collision_cfg is stored before the
use_mujoco_contacts consistency check runs.  Reset _collision_cfg
in clear() to avoid stale state across reset cycles.  Fall back to
a default CollisionPipeline when collision_cfg is None.
Move config resolution out of NewtonManager into
NewtonCollisionPipelineCfg.to_pipeline_args(), following the Kamino
to_solver_config() pattern. Fix truthiness check on hydroelastic config
dict to use explicit `is not None`. Add missing changelog entry for
MJWarpSolverCfg.tolerance. Use specific return type hint dict[str, Any].
Add SDFCfg @configclass for SDF mesh collision configuration with fields
for resolution, narrow band, margin, body/shape pattern matching,
per-pattern resolution overrides, visual mesh fallback, and hydroelastic
stiffness assignment.

Add five missing fields to HydroelasticSDFCfg: moment_matching,
buffer_mult_broad, buffer_mult_iso, buffer_mult_contact, and grid_size.
Add sdf_cfg field (SDFCfg | None) to NewtonCfg and re-export SDFCfg
from the physics package __init__.pyi so callers can configure SDF
mesh collision at the top-level manager config.
Add _build_sdf_on_mesh, _create_sdf_collision_from_visual, and
_apply_sdf_config classmethods to NewtonManager. These methods apply
SDFCfg settings to matching mesh shapes in the model builder, including
per-pattern resolution overrides and optional hydroelastic flags.
When SDF patterns are configured, exclude matching mesh shapes from
convex-hull approximation so their triangle geometry is preserved for
mesh.build_sdf(). Also call NewtonManager._apply_sdf_config() on each
prototype before add_builder copies it N times, so SDF is built once
and all environments inherit it.
Unit tests for NewtonManager._build_sdf_on_mesh and
_apply_sdf_config using unittest.mock to avoid needing a running
Newton simulation. Covers resolution overrides, hydroelastic flags,
body/shape pattern matching, and non-mesh shape exclusion.
Fix max_resolution/target_voxel_size docstrings to accurately describe
that both are forwarded to Newton. Remove unreachable collision_cfg
force-override block in initialize_solver.
- Pass hydro_patterns to _create_sdf_collision_from_visual so it
  respects hydroelastic_shape_patterns filter (was unconditionally
  setting HYDROELASTIC)
- Replace getattr(cfg, "sdf_cfg", None) with direct attribute access
  in all 3 locations
- Add regex validation with actionable ValueError on invalid patterns
- Log warning and return False from _build_sdf_on_mesh when mesh is
  None; only count successful SDF builds in num_patched
- Fix :attr: -> :class: for SDFCfg reference in docstring
## Summary

- Adds ``NewtonCfg.collision_decimation`` (default ``0`` = legacy
  one-collide-per-tick behaviour).
- When positive, ``_run_solver_substeps`` re-invokes the collision
  pipeline every ``collision_decimation`` substeps so contact normals
  reflect the bodies' just-integrated poses instead of the poses from
  the top of the tick.
- The last substep is intentionally skipped — its contact set would
  only affect the next tick, which the top-of-tick ``collide()`` in
  ``_simulate_physics_only`` / ``_simulate_full`` owns.
- ``NewtonCfg.__post_init__`` warns when ``collision_decimation >=
  num_substeps`` (silently equivalent to ``0``).

## Test plan

- [x] ``./isaaclab.sh -p -m pytest source/isaaclab_newton/test/physics/test_newton_manager_abstraction.py`` — 39/39 pass
- [x] ``./isaaclab.sh -f`` — pre-commit clean
## Summary

- Adds ``test_newton_cfg_collision_decimation_warning`` (parametrized
  over ``(num_substeps, collision_decimation)``) for the
  ``__post_init__`` warning gate.
- Adds ``test_collision_decimation_invokes_mid_loop_collide`` that wraps
  ``NewtonManager._collision_pipeline.collide`` with a counter, runs one
  physics tick with a sphere falling onto a ground plane, and asserts
  the call count equals ``1 + floor((num_substeps - 1) /
  collision_decimation)``.
- Adds ``source/isaaclab_newton/changelog.d/mzamoramora-collision-decimation.minor.rst``.

## Test plan

- [x] ``./isaaclab.sh -p -m pytest source/isaaclab_newton/test/physics/test_newton_manager_abstraction.py`` — 39/39 pass (28 existing + 11 new)
- [x] ``./isaaclab.sh -f`` — pre-commit clean
…nfiguration

Merged Antoine's antoiner/newton-sdf-config branch with conflict resolution:

- extension.toml / CHANGELOG.rst: kept HEAD (develop has advanced past
  PR's manual version bumps; changelog.d conventions supersede manual edits).
- __init__.pyi: union of exports — kept develop's manager classes + added
  SDFCfg.
- newton_collision_cfg.py: union — kept develop's import/doc updates,
  added back PR's 5 hydroelastic fields and the SDFCfg class.
- newton_replicate.py: union — kept develop's deformable-prim handling
  hooks, layered PR's SDF-pattern skip logic and _apply_sdf_config call.
- newton_manager.py: kept develop's structure (modernized API),
  surgically grafted PR's three SDF helper methods (_build_sdf_on_mesh,
  _create_sdf_collision_from_visual, _apply_sdf_config) and 're' import.
- newton_manager_cfg.py: kept develop's structure, added sdf_cfg field on
  NewtonCfg and SDFCfg import.

Not carried over: PR's initialize_solver SDF-detection branch and
_initialize_contacts hydroelastic warning (deferred — develop's
initialize_solver has been refactored to delegate to _build_solver).

This vendoring is for local use on the bundle branch. When PR isaac-sim#5228
lands upstream, this merge commit is reverted and develop is pulled in.
``RigidObjectData._create_simulation_bindings`` crashed with
``IndexError: tuple index out of range`` whenever the rigid object
was fixed-base AND its root view only contained a single body —
e.g. a kinematic-enabled rigid object built from a USD that
authors a ``PhysicsFixedJoint``. Newton's ``get_root_transforms``
returns a 2D ``(count, links)`` array in that scenario, but the
legacy ``is_fixed_base`` branch unconditionally indexed
``[:, 0, 0]`` assuming a 3D ``(count, links, 1)`` layout.
Dispatch on actual ``ndim`` so both fixed-base multi-link
articulations and single-body kinematic rigid objects are handled.

``_create_buffers`` similarly allocated the no-velocity fallback
for ``_sim_bind_body_com_vel_w`` as 1D, which the
``derive_body_acceleration_from_body_com_velocities`` kernel
rejected at launch time (expects 2D). Allocate it as
``(num_instances, 1)`` to match the kernel signature.
Add classmethod `NewtonManager.sanitize_world_state(env_ids)` that
zeroes per-world MJWarp solver scratch (qacc_warmstart, qfrc_*, cacc,
cfrc_*) and Newton State velocity/force buffers (joint_qd, body_qd,
body_f, body_qdd, body_parent_f) for the given worlds, then runs
eval_fk to re-derive body_q.

Generalizes the in-tree Factory NaN scrubber so any task on Newton can
recover NaN-divergent worlds in place via:

    if has_nan_envs:
        NewtonManager.sanitize_world_state(nan_env_ids)

Compared to mjwarp.reset_data: closes the gap on derived qfrc_* / cacc
/ cfrc_int / cfrc_ext that reset_data doesn't touch. cfrc_int in
particular is read-modify-written via wp.atomic_add in
mujoco_warp/_src/smooth.py:_cfrc_backward, so a stale NaN there
survives the next rne() call and re-contaminates qfrc_bias.

No-op for solvers without mjw_data (XPBD, Featherstone, Kamino) since
those don't exhibit the MuJoCo warm-start contamination pattern.
@github-actions github-actions Bot added the isaac-lab Related to Isaac Lab team label May 28, 2026

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This PR bundles four prerequisite changes for Factory NutThread support on the Newton backend. The implementation is well-structured overall with proper configuration separation and comprehensive test coverage for the SDF configuration logic.

✅ Strengths

  • Clean separation between configuration classes and implementation
  • Proper handling of mid-tick re-collide skip on last substep
  • Comprehensive unit tests for SDF configuration (14 tests)
  • Good docstrings explaining the rationale for defensive buffer zeroing in sanitize_world_state

🔍 Findings

1. [Medium] Potential device mismatch in flag_nan_envs (newton_manager.py, ~line 1575)

When _nan_env_mask_pending_reset is first created via torch.zeros_like(mask), it inherits the device from the input mask. If a subsequent call passes a mask on a different device, the |= operation will fail.

Suggested fix:

if cls._nan_env_mask_pending_reset is None:
    cls._nan_env_mask_pending_reset = torch.zeros_like(mask)
elif cls._nan_env_mask_pending_reset.device != mask.device:
    mask = mask.to(cls._nan_env_mask_pending_reset.device)
cls._nan_env_mask_pending_reset |= mask

Alternatively, document that all masks must be on the same device as the first call.


2. [Medium] Defensive assertion for unexpected ndim (rigid_object_data.py, ~line 815)

The ndim >= 3 check correctly handles the 2D vs 3D dispatch, but an ndim of 1 or 4+ would silently take the "else" branch. Making the assumption explicit would fail fast on unexpected Newton API changes:

root_xforms = self._root_view.get_root_transforms(SimulationManager.get_state_0())
if root_xforms.ndim not in (2, 3):
    raise ValueError(f"Unexpected root_xforms ndim={root_xforms.ndim}, expected 2 or 3")
self._sim_bind_root_link_pose_w = root_xforms[:, 0, 0] if root_xforms.ndim == 3 else root_xforms[:, 0]

3. [Low/Suggestion] Add trace logging when sanitize_world_state is no-op (newton_manager.py, ~line 1511)

For debugging purposes, it can be helpful to know when sanitize_world_state is a no-op due to a non-MJWarp backend:

if mjw_data is None:
    logger.debug("sanitize_world_state: skipped (no mjw_data, likely non-MJWarp solver)")
    return

4. [Low/Suggestion] Duplicate regex compilation helper (newton_replicate.py, line 18)

The _compile_sdf_patterns function duplicates the same logic in newton_manager.py:_apply_sdf_config. Consider extracting to a shared utility in newton_collision_cfg.py to avoid drift between the two implementations.


📋 Test Coverage Notes

The PR acknowledges that standalone regression tests for the kinematic single-body fix and sanitize_world_state NaN recovery are follow-up items. Consider adding basic tests for the deferred NaN queue (flag_nan_envs/sanitize_pending_nan_envs) to prevent regressions in that coordination logic.

@kellyguo11 kellyguo11 added this to the Isaac Lab 3.0 GA milestone May 28, 2026
@kellyguo11 kellyguo11 moved this to In progress in Isaac Lab May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

isaac-lab Related to Isaac Lab team

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

4 participants