Merge main into develop#615
Conversation
## Summary Explain the resume flag
…572) ## Summary This is a doc change requested by QA in https://nvbugspro.nvidia.com/bug/6063011 It clarifies that evaluated newton trained model using physx is expected to completely fail the dexsuite task.
## Summary Subprocess-spawning tests hang indefinitely on CI. ## Causes & Fixes ### Problems From Lab: 1. Lab reports "AppLauncher doesnt quit properly after app.close(), app.quit() doesn't help either." 2. Cold startup times for tests using IS can be upwards of 10 min on Lab CI machines. Above issues apply to us, because tests hang during sub-process tests section, between the end of last test and the beginning of the next test. See detailed logs and analysis from reproducing locally [here](#568) ### Fixes 1. `SimulationApp` Force Exit: Skips `app.close()` (which can hang indefinitely in Kit's shutdown path) when the env var `ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE=1` is set. Calls a new `_kill_child_processes()` helper that walks `/proc` to `SIGKILL` all direct children before doing `os._exit(0)`, preventing orphaned Kit processes from holding GPU resources. 2. `run_subprocess` has a configuarable wall-clock timeouts and process isolation, such that when needed, it could trigger the force exit path above. 3. Add wall-clock timing and logging inside the SimulationApp start method. Keep track of how much startup time is taking on CI. ## Minor fixes 1. Add timing stats into pytest cmds such that it reports the slowests test func at the end of each section. 2. Parametrize multi-config tests: Convert nested for-loops in `test_zero_action_policy_kitchen_pick_and_place` (6 configs) and `test_zero_action_policy_gr1_open_microwave` (3 configs) into `@pytest.mark.parametrize.` Each config gets its own timeout, pass/fail, and timing. 3. Reduce num_envs in gr00t eval_runner test to speed up. ### Local validation With the repro script #568, I do not have local stalling. Log for more details. [repro_20260410_041313.log](https://github.com/user-attachments/files/26620524/repro_20260410_041313.log) ### CI Before -- timeout <img width="1219" height="170" alt="image" src="https://github.com/user-attachments/assets/2f9eabb2-403d-4257-bd84-4da508de7d00" /> ### CI After <img width="1219" height="170" alt="image" src="https://github.com/user-attachments/assets/dbaf2a7d-e3a4-4ad2-85a4-389eae962c1d" /> <img width="1198" height="472" alt="image" src="https://github.com/user-attachments/assets/8a24f1aa-4bcb-4030-b075-09f3885673c2" /> ## TODOs - test_camera_observations takes 10mins to start the app due to Kit cold start. Experimenting with a warm start before tests process here #565 - Kit itself intermittently deadlocks during startup — not because of orphans, but because Kit's internal thread synchronization fails on low-CPU runners. Experimenting with retry here #570
## Summary Install missing arena package into NGC docker. ## Detailed description - We forgot to install our new package `isaaclab_arena_examples` into the docker image. - This was masked in CI due to mounting a branch and correctly installing there. Co-authored-by: Xinjie Yao <[email protected]>
## Summary - Fix the `eval_config.json` example in the DexSuite Kuka Allegro Lift evaluation docs to match the actual `eval_runner.py` schema (`jobs` array with `name`, `arena_env_args`, `policy_type`, `policy_config_dict`). Signed-off-by: Clemens Volk <[email protected]> Co-authored-by: Xinjie Yao <[email protected]>
## Summary Doc fix to https://nvbugspro.nvidia.com/bug/6062848, Readme updates. ## Detailed description - Policy training docs: Added a "Compute Requirements" section (GPU VRAM + system RAM guidance) to all three workflow tutorials (static_manipulation, sequential_static_manipulation, locomanipulation) and fixed the "an an" typo. - Arena-in-your-repo docs: Created an index.rst landing page for the section and updated docs/index.rst to use it instead of listing the three sub-pages individually. - README: Added a link to the "Installing IsaacLab-Arena in Your Repository" guide in the "Publishing Your Own Benchmark" section.
## Summary As CI seems to run smoothly agin, bring back previously disabled tests.
## Problem IsaacLab-Arena needs a tabletop manipulation task where the G1 robot uses the WBC-AGILE locomotion policy to pick up an apple and place it on a plate, while balancing in place. Ref: ISAAC-12630 ## Solution Add a new `G1AgileTabletopAppleToPlateEnvironment` that wires the `G1WBCAgileJointEmbodiment` (from PR #489) with the existing `PickAndPlaceTask`, a Seattle Lab table scene, and appropriate object assets. ## Changes - **`isaaclab_arena_environments/g1_agile_tabletop_apple_to_plate_environment.py`** — New environment class: G1 robot at (-0.4, 0, 0) facing a table with an apple (pick object) and a clay plate (target). Uses `G1WBCAgileJointEmbodiment` for balance + upper body control. 30-second episodes. Supports `--object`, `--embodiment`, `--teleop_device` CLI args. - **`isaaclab_arena_environments/cli.py`** — Register the new environment in the `ExampleEnvironments` dict. - **`isaaclab_arena/tests/test_g1_agile_tabletop_apple_to_plate.py`** — Two tests: (1) initial state is not terminated (apple starts away from plate), (2) teleporting apple onto plate triggers success termination. Uses correct base-height command (0.75) to keep the robot stable. ## Testing - [x] New unit tests added (2 tests) - [x] Linters pass locally (black, flake8, isort, pyupgrade, codespell, license headers) - [ ] CI pipeline (tests require Isaac Sim Docker with GPU) ## Notes - Object positions (apple, plate, robot) are based on Seattle Lab table geometry and G1 arm reach. May need visual tuning in simulator. - No new task class needed — the existing `PickAndPlaceTask` handles contact-sensor success detection, object-dropped termination, and metrics. - Self-review caught and fixed a test issue: the initial-state test was sending zero base-height commands, which would cause the robot to squat. Fixed to use 0.75 (matching established pattern from `test_g1_wbc_embodiment.py`). --- *Generated by [autodev](https://github.com/anthropics/claude-code) — Claude Code* --------- Signed-off-by: Lionel Gulich <[email protected]> Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
## Summary Rework the concepts documentation to eliminate AI slop. --------- Signed-off-by: Clemens Volk <[email protected]> Co-authored-by: Clemens Volk <[email protected]> Co-authored-by: isaaclab-review-bot[bot] <270793704+isaaclab-review-bot[bot]@users.noreply.github.com> Co-authored-by: Xinjie Yao <[email protected]>
## Motivation When building tasks, users often need to restrict object placement to a sub-region of a surface -- for example, only within the robot's reachable workspace. `On(table)` allows placement anywhere on the table, and `AtPosition` pins to a single point. There was no way to constrain to a region or set bounds on individual axes. ## Summary - New `PositionLimits` unary relation that constrains object position in world coordinates. Supports full ranges (box), single bounds (half-plane), or a mix per axis. - New `UnaryRelation` base class so `get_spatial_relations()` automatically includes any new unary relation without updating isinstance checks. - `PositionLimitsLossStrategy` using `linear_band_loss` (both bounds) and `single_boundary_linear_loss` (single bound). - Registered in solver strategies with slope=100.0 (matching `AtPosition`). - Fixed `_print_unary_relation_debug` to work with any unary relation type. ## Usage ```python # Full box constraint (reachable region) apple.add_relation(On(table)) apple.add_relation(PositionLimits(x_min=-0.3, x_max=0.3, y_min=-0.2, y_max=0.2)) # Single bound (half-plane) apple.add_relation(PositionLimits(x_min=0.5)) # Mix apple.add_relation(PositionLimits(x_min=-0.3, x_max=0.3, y_min=-0.2)) ``` ## Test plan - [x] 12 PositionLimits tests pass (11 strategy-level, 1 solver integration) - [x] All relation/placer tests pass - [x] Pre-commit checks pass --------- Signed-off-by: Clemens Volk <[email protected]> Co-authored-by: Xinjie Yao <[email protected]>
## Summary Test code owners by only adding myself. ## Detailed description - We had an incident where a couple of bots with organization access collaborated to push an unreviewed change to main. - This is an attempt to prevent this in the future.
## Summary Complete the list code owners. ## Detailed description - Follows successful test #584
## Summary Fix CODEOWNERS specification. ## Detailed description - Mutiple lines indicate tha the last line overrides the previous. - This is not what was intended. - Fix.
## Summary Bring IsaacLab issue templates into Isaac Lab - Arena ## Detailed description - Gives users a structure for bug reports and feature requests.
## Summary This is to fix teleop crashing https://nvbugspro.nvidia.com/bug/6066640 The root cause is isaacteleop has a regression in latest 1.2.xxx. 1.1.x should be the latest stable version to use and Teleop team will push patches to fix the issues on Teeleop side. Teleop on arena side verified to work after rebuilding the arena docker with this change. Co-authored-by: Xinjie Yao <[email protected]>
## Summary Remove server client from v0.2 release docs ## Detailed description - We plan on reworking the server client to fully support it in v0.3 - The current implementation of the server-client, and it's documentation, are only half supported. - Remove the documentation references to the server-client and aim for full support in `v0.3` - Address [6072205](https://nvbugspro.nvidia.com/bug/6072205)
## Summary Clean up type annotations in the environment files ## Detailed description - Type annotation were not properly done at the start of the project, and that propagated over time to all environment files. - This cleans that up.
## Summary The config in the doc is mistakenly set for AVP instead of for Quest/Pico handtracking. Correcting the doc. This fixes issue reported from https://nvbugspro.nvidia.com/bug/6076546
…596) ## Summary CI subprocess tests are slow and faced with timeout without stalling ## Detailed description - Skipped `test_eval_runner_enable_cameras` as cold-start camera rendering takes ~1165s, making CI exceeding the timeout. - Replaced raw `subprocess.run()` with the shared `run_subprocess()` helper, which enforces `ISAACLAB_ARENA_SUBPROCESS_TIMEOUT` (900s in CI). - Removed redundant stdout-regex failure check; the eval_runner already exits non-zero on job failure (no` --continue_on_error`).
## Summary Address https://nvbugspro.nvidia.com/bug/6062848 ## Detailed description In a multi-GPU setup, the standard output (stdout) buffer gets flooded with logs from secondary GPUs. As a result, the wandb prompt requesting user input gets buried in the output. Because the prompt goes unanswered, the data loading process stalls, eventually leading to a timeout
## Summary Set GR1 XrCfg to anchor on robot pelvix similar to G1. Teleop initial view aligns with robot head. This fix bug https://nvbugspro.nvidia.com/bug/6076070 --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
## Summary Sanitize numpy types in metrics for readable logging and JSON export. Fix to https://nvbugspro.nvidia.com/bug/6077892 ## Detailed description - **Why:** Metric values from environment rollouts can be `np.float32`, `np.int64`, or `np.ndarray`. These print as `np.float32(0.85)` instead of `0.85` and are not JSON-serializable by default. - **What changed:** Added a `sanitize_metrics()` utility in `metrics_logger.py` that converts numpy scalars to `float` and numpy arrays to Python lists. Used it in `policy_runner.py` (metrics print after rollout) and `MetricsLogger.append_job_metrics()` (sanitizes on ingestion so both `print_metrics()` and `save_metrics_to_file()` get clean types). - **Impact:** Metrics are now human-readable when printed and safely serializable to JSON. No behavioral change.
## Summary Teleop bug https://nvbugspro.nvidia.com/bug/6081144 might be related to missing the env sourcing step. Update the doc so it is in a separate box and more noticable.
Reverts #562 Co-authored-by: Alex Millane <[email protected]>
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
Merge PR syncing main → develop. The changes span five areas: (1) isaacteleop version bump to ~=1.1.0, (2) documentation restructuring of teleop workflows to numbered-step format with an important cloudxr.env ordering note, (3) XR anchor refactor from world-space offset composition to pelvis-relative prim anchoring on GR1T2 (aligning it with G1's existing pattern), (4) a metrics_to_plain_python_types() utility to fix JSON serialization of numpy types, and (5) widespread adoption of TYPE_CHECKING for deferred imports across all environment files.
The changes are well-structured and internally consistent. The deleted common.py has no remaining references, and the base class get_xr_cfg() correctly returns self.xr which is now set directly. Two minor suggestions below.
Design Assessment
Design is sound. The XR anchor refactor is a good simplification — pelvis-relative anchoring removes the need for world-space pose composition and makes the behavior consistent between GR1T2 and G1. The TYPE_CHECKING pattern cleanly resolves the long-standing annotation issue noted in multiple TODO comments.
Findings
(Detailed findings posted as inline comments on the relevant lines.)
🔵 Suggestion: metrics_to_plain_python_types() return type annotation is narrower than actual — see inline comment.
🔵 Suggestion: metrics_to_plain_python_types() doesn't handle nested dicts — noted inline as a minor enhancement.
Test Coverage
- XR refactor: Tests properly updated — the old world-space composition tests are replaced by unified pelvis-relative assertions covering both
gr1_pinkandg1_wbc_pink. The test correctly verifies thatset_initial_pose()does not alter the anchor config (the key behavioral change). - Metrics utility: No dedicated unit test for
metrics_to_plain_python_types(), but the function is straightforward and exercised viaappend_job_metrics()in integration tests. - Eval runner: Test simplification from stdout parsing to exit-code checking is an improvement in robustness. The
@pytest.mark.skipadditions for camera cold-start are pragmatic CI fixes.
CI Status
Pre-commit check is in progress.
Verdict
Ship it — Clean merge with no blocking issues. The two inline suggestions are optional improvements.
## Summary Reduce to a minimum number of policy runner tests. ## Detailed description - These tests are slow and flaky - Reduce to a minimal number to try to speed up CI and decrease the probability it stalls. ## Not done - This reduces our test coverage on our environments (which was already very low) - I will try, in a follow up MR, to add complete coverage of the environments **in process**, so that they're fast to run.
## Summary Address https://nvbugspro.nvidia.com/bug/6084606. Update training doc system requirements and defaults ## Detailed description Across all three workflow docs (locomanipulation, static_manipulation, sequential_static_manipulation): - Increased recommended system RAM from 256 GB to 512 GB - Changed --dataloader_num_workers from 8 to 16 in all training commands - Added a .. note:: explaining that global_batch_size and dataloader_num_workers can be reduced on less powerful hardware at the cost of longer training time
Greptile SummaryThis PR merges
Confidence Score: 4/5Safe to merge pending resolution of the quaternion denormalization P1 in gr1t2.py. One P1 from a prior review (additive quaternion noise without renormalization in gr1t2.py) remains unaddressed; all other prior concerns are P2 or confirmed false-positive. No new critical issues were found in this merge. isaaclab_arena/embodiments/gr1t2/gr1t2.py (quaternion noise renormalization) Important Files Changed
Sequence DiagramsequenceDiagram
participant CLI
participant PolicyRunner
participant ArenaEnvBuilder
participant Env
participant Policy
participant MetricsLogger
CLI->>PolicyRunner: main()
PolicyRunner->>ArenaEnvBuilder: get_arena_builder_from_cli(args_cli)
ArenaEnvBuilder->>Env: make_registered_and_return_cfg()
PolicyRunner->>Policy: policy_cls.from_args(args_cli)
PolicyRunner->>PolicyRunner: rollout_policy(env, policy, num_steps, num_episodes)
loop Each step/episode
PolicyRunner->>Policy: get_action(env, obs)
Policy-->>PolicyRunner: actions
PolicyRunner->>Env: env.step(actions)
Env-->>PolicyRunner: obs, terminated, truncated
alt terminated or truncated
PolicyRunner->>Policy: reset(env_ids)
end
end
PolicyRunner->>Env: compute_metrics(env.unwrapped)
Env-->>PolicyRunner: metrics
PolicyRunner->>MetricsLogger: metrics_to_plain_python_types(metrics)
PolicyRunner->>CLI: print metrics
Reviews (3): Last reviewed commit: "Fix docs." | Re-trigger Greptile |
Summary
Merge main into develop