Fix cartpole-camera frame-stacking performance regression#5849
Fix cartpole-camera frame-stacking performance regression#5849jmart-nv wants to merge 10 commits into
Conversation
…isaac-sim#5574. - circular_buffer.py: replace torch.roll with O(1) write-index append; add CPU mirror of max_len and CPU bool gate to skip GPU sync on the steady-state path; rotate on read via the `buffer` property. - observations.py (stacked_image): drop trailing `.clone()` from the permute().reshape() chain (reshape on non-contig already allocates). - cartpole_camera_presets_env.py: same `.clone()` drop in the direct env. - delay_buffer.py: same `.clone()` drop on the DelayBuffer.compute return.
Skip per-frame /255 + mean-subtract on the stacking path; apply once on the stacked output. Ring buffer now holds native uint8 (4x cheaper per-step copies). Math is identical because K frames live in disjoint channel slices. Affects stacked_image MDP term and the cartpole DirectRLEnv parent (new `normalize: bool = True` kwarg) + subclass.
Replace the three-pass PyTorch normalize on the camera-observation hot path with a single fused Warp kernel + small mean reduction. Consolidate the per-data-type dispatch into a shared helper used by image(), stacked_image, CartpoleCameraEnv, and CartpoleCameraPresetsEnv. Frame-stacking subclasses pre-allocate the float32 output once and reuse it across steps, eliminating the per-step transient that the previous deferred-normalize path incurred.
The Warp kernel produces int32 partial sums along H; PyTorch collapses the partials and divides once. Channels-innermost launch shape gives coalesced reads on src's contiguous trailing dim. Partials scratch is cached per (shape, device).
- stacked_image / CartpoleCameraPresetsEnv: don't reuse a pre-allocated normalize-output buffer across env.step; the previous-iteration obs (held by the RL trainer) was overwritten before record_transition could read it, causing PPO to record degenerate transitions - normalize_image_uint8: docstring warning on the out= aliasing hazard - CircularBuffer.reset: setitem (= 0.0) instead of advanced-getitem + .zero_() so partial-batch resets actually zero the storage - CircularBuffer.stacked: drop unreachable empty-buffer guard - _uint8_sum_partials_cache: store the tensor directly, not a tuple - Tests: regressions for both bugs, image() clone kwarg, and DelayBuffer.compute() no-aliasing contract
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
This PR addresses a significant 42% throughput regression introduced by frame-stacking in #5574 for the cartpole-camera task. The fix implements several well-designed optimizations: (1) storing uint8 frames in the ring buffer instead of float32 (4× smaller), (2) deferring normalization past the frame-stack buffer, (3) replacing torch.roll with memmove-style shifting, (4) adding a fused Warp kernel for RGB normalize, and (5) providing a free contiguous stacked view via the new stack_dim parameter. The result is a 24% improvement over the pre-framestack baseline (67.87 FPS vs 54.62 FPS) with identical PPO convergence.
Findings
🔵 Suggestion — source/isaaclab/isaaclab/utils/warp/ops.py:30-46
The _uint8_sum_partials_cache is a module-level dict that grows unbounded as different (shape, device) combinations are used. Consider adding a cache eviction strategy (e.g., LRU with bounded size) or documenting the expected cache growth for long-running training jobs.
🔵 Suggestion — source/isaaclab/isaaclab/utils/buffers/circular_buffer.py:174-179
The memmove-style shift loop for i in range(k - 1): self._buffer.narrow(...).copy_(...) issues k-1 sequential copy operations. For typical frame-stacking (K=2-4), this is fine. For larger K values, a batched roll-style operation might be more efficient. Document the expected K range or add a threshold-based dispatch.
🔵 Suggestion — source/isaaclab/isaaclab/utils/images.py:67-71
The PyTorch fallback path allocates a new tensor via images.float() / 255.0 before subtracting mean. For very large batch sizes, this could be optimized with an in-place subtract after the division. Not critical for current use cases.
🟡 Warning — source/isaaclab/isaaclab/utils/warp/ops.py:59-61
The _UINT8_SUM_TILE_HW = 32 tile size is hardcoded. While the comment explains the rationale, this value may not be optimal for all GPU architectures. Consider making this configurable or auto-tuning based on device properties for future performance work.
🔵 Suggestion — source/isaaclab/isaaclab/envs/mdp/observations.py:736-740
The docstring warning about out= aliasing hazard is excellent. The implementation correctly avoids the aliasing bug by not passing out= in stacked_image.__call__, allocating fresh storage per call. This design choice is well-documented in the commit message and tested.
Test Coverage
✅ Excellent coverage. The PR adds:
test_stacked_image_mdp.py: Tests for RGB buffer dtype, normalize math equivalence, and the critical consecutive-call aliasing regression test that prevents the obs overwrite bugtest_circular_buffer.py: 14 new tests coveringstack_dimmode, reset behavior, warmup, ring overflow, and index rejectiontest_delay_buffer.py: Regression test for compute result aliasingtest_images.py: Full coverage of the normalize dispatch paths and predicatestest_image_ops.py: Comprehensive Warp kernel tests including multi-tile, non-divisible H, and partials cache reuse
The aliasing regression test (test_consecutive_rgb_calls_return_independent_storage) specifically guards against the bug that was fixed in commit 2ddac7c.
CI Status
⏳ Pending — Most CI jobs are still running. Pre-commit, changelog fragments, and basic build checks have passed. The core test suites (isaaclab (core), isaaclab_tasks) are pending.
Verdict
Minor fixes needed — This is a well-architected performance fix with solid test coverage and clear documentation. The suggestions above are non-blocking improvements. The code correctly handles the aliasing hazard that was discovered during development, and the regression tests provide good guardrails. Ready to merge once CI completes successfully.
| """Whether ``data_type`` is one of the RGB-like camera outputs (rgb, albedo, simple_shading_*). | ||
|
|
||
| Args: | ||
| data_type: The camera data-type string from ``sensor.data.output`` / ``TiledCameraCfg``. |
There was a problem hiding this comment.
TiledCameraCfg is deprecated, it should be CameraCfg.
|
|
||
| Raises: | ||
| ValueError: If the buffer size is less than one. | ||
| ValueError: If the buffer size is less than one, or ``stack_dim == 0``. |
There was a problem hiding this comment.
Few lines below you throw only when is not None and stack_dim == 0. -1 means you are concatenating based on W, and to my understanding that is a legit axis to concatenate.
| # Lazily allocated on the first ``append``. | ||
| self._buffer: torch.Tensor = None # type: ignore | ||
|
|
||
| self._stack_dim_arg: int | None = stack_dim |
There was a problem hiding this comment.
It took me some time to understand stacking terminology here. Why this is called stacking? This is like gluing frames side-by-side. Initially, I was under the impression that you actually have n-buffers represent stack, but what happens here is frame concatenation based on stack dimension?
Description
Frame-stacking was added to cartpole-camera by #5574 to make the task solvable on the Newton+Warp backend. Single-frame RGB observations don't carry pole velocity, so a single-image policy can only recover ~100/300 mean episode reward. On PhysX+RTX, implicit damping in the dynamics plus temporal antialiasing in the renderer leak enough adjacent-frame correlation that a single-frame policy still hits ~296. Newton+Warp has neither, so explicit frame stacking is required for the task to be solvable on this backend.
The implementation regressed throughput by 42% (54.6 → 31.7 FPS) and inflated peak VRAM 2.7× (3.13 → 8.37 GB) at R256, 1024 envs. PR #5574 added a
CircularBufferring + astacked_imageMDP term that:x/255 - mean(x)) before the buffer, forcing the buffer to be float32.torch.rollon every append → ring-sized temporary per env.step.permute + reshape + clone→ fresh(B, H, W, K*C)float32 tensor per env.step.This PR rewrites the frame-stacking hot path; the result runs 24% faster than the pre-framestack baseline with identical PPO convergence.
x/255 − meannormalize past the frame-stack buffer for RGB-like data types.CircularBuffer.stack_dimarg +.stackedproperty arrange storage so the channel-stacked output is a contiguous view of internal storage.(B, H, W, K*C)-sized allocation and a permute kernel per env.step.torch.rollon append. Replace with a front-to-back memmove.normalize_image_uint8). One kernel:(uint8 / 255) − per-(batch, channel) mean → float32.spatial_sum_uint8_tiled). Warp kernel writes(B, NUM_TILES, C)int32 partials along H; PyTorch collapses the partial dim.isaaclab.utils.images.normalize_camera_imagedispatch. One helper used by bothimage()(DirectRLEnv) andstacked_image.__call__(ManagerBasedEnv); routes uint8 contiguous inputs through the Warp fast path and falls back to PyTorch otherwise.image(clone=False)opt-out. Callers that immediately copy into their own storage skip the defensive clone._get_observationsallocates the normalize output fresh each call (no persistentout=buffer).record_transitionreturns; reusing one buffer across env.step boundaries aliases the trainer's prior obs. Fresh allocation lets PyTorch's caching allocator hand out a different block while the prior is still referenced.Type of change
Results
R256 (256×256), 1024 envs, PPO via skrl 2.1.0, 32k training steps, seed 42, perf measured over n=3 warm trials:
Screenshots
Environment
NVIDIA L40 (48 GB), driver 570.158.01. Newton+Warp backend (
presets=newton_mjwarp,newton_renderer). skrl 2.1.0 trainer.Checklist
pre-commitchecks with./isaaclab.sh --formatconfig/extension.tomlfileCONTRIBUTORS.mdor my name already exists there