Skip to content

Fix cartpole-camera frame-stacking performance regression#5849

Draft
jmart-nv wants to merge 10 commits into
isaac-sim:developfrom
jmart-nv:jmart/framestack-perf
Draft

Fix cartpole-camera frame-stacking performance regression#5849
jmart-nv wants to merge 10 commits into
isaac-sim:developfrom
jmart-nv:jmart/framestack-perf

Conversation

@jmart-nv
Copy link
Copy Markdown

@jmart-nv jmart-nv commented May 28, 2026

Description

Frame-stacking was added to cartpole-camera by #5574 to make the task solvable on the Newton+Warp backend. Single-frame RGB observations don't carry pole velocity, so a single-image policy can only recover ~100/300 mean episode reward. On PhysX+RTX, implicit damping in the dynamics plus temporal antialiasing in the renderer leak enough adjacent-frame correlation that a single-frame policy still hits ~296. Newton+Warp has neither, so explicit frame stacking is required for the task to be solvable on this backend.

The implementation regressed throughput by 42% (54.6 → 31.7 FPS) and inflated peak VRAM 2.7× (3.13 → 8.37 GB) at R256, 1024 envs. PR #5574 added a CircularBuffer ring + a stacked_image MDP term that:

  1. Stored frames as float32 even when the camera output was uint8 → 4× ring memory and 4× shift bandwidth.
  2. Ran the per-frame normalize (x/255 - mean(x)) before the buffer, forcing the buffer to be float32.
  3. Re-rolled the buffer with torch.roll on every append → ring-sized temporary per env.step.
  4. Built the channel-stacked policy obs via permute + reshape + clone → fresh (B, H, W, K*C) float32 tensor per env.step.
  5. Inlined the normalize math at 5 call sites across DirectRLEnv and ManagerBasedEnv paths.

This PR rewrites the frame-stacking hot path; the result runs 24% faster than the pre-framestack baseline with identical PPO convergence.

Change Why
uint8 ring buffer. Defer the x/255 − mean normalize past the frame-stack buffer for RGB-like data types. Ring is 4× smaller; shift bandwidth per env.step drops 4×.
Free contiguous stacked output. New CircularBuffer.stack_dim arg + .stacked property arrange storage so the channel-stacked output is a contiguous view of internal storage. Eliminates one (B, H, W, K*C)-sized allocation and a permute kernel per env.step.
Drop torch.roll on append. Replace with a front-to-back memmove. One fewer ring-sized allocation and one fewer kernel launch per env.step.
Fused Warp normalize kernel (normalize_image_uint8). One kernel: (uint8 / 255) − per-(batch, channel) mean → float32. Replaces upcast + reduce + subtract (three PyTorch passes) with one fused launch over the K-stacked input.
Tiled int32 partial-sum reduction kernel (spatial_sum_uint8_tiled). Warp kernel writes (B, NUM_TILES, C) int32 partials along H; PyTorch collapses the partial dim. int32 accumulator is safe up to H·W·255 (128× headroom at R256) and ~3× faster than int64 / float32 on PyTorch's auto-tuned reduce path. C-innermost launch shape gives stride-1 reads on src's contiguous trailing dim.
Shared isaaclab.utils.images.normalize_camera_image dispatch. One helper used by both image() (DirectRLEnv) and stacked_image.__call__ (ManagerBasedEnv); routes uint8 contiguous inputs through the Warp fast path and falls back to PyTorch otherwise. Removes inline normalize math duplicated across 5 sites; the fast path picks itself up automatically.
image(clone=False) opt-out. Callers that immediately copy into their own storage skip the defensive clone. Saves one obs-sized allocation per env.step at the consumer site (frame-stack buffer).
Allocator-managed obs lifetime. _get_observations allocates the normalize output fresh each call (no persistent out= buffer). The RL trainer holds a reference to the previous-iteration observation until record_transition returns; reusing one buffer across env.step boundaries aliases the trainer's prior obs. Fresh allocation lets PyTorch's caching allocator hand out a different block while the prior is still referenced.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Results

R256 (256×256), 1024 envs, PPO via skrl 2.1.0, 32k training steps, seed 42, perf measured over n=3 warm trials:

Stage FPS step_ms Peak VRAM Mean reward (32k)
Pre-framestack baseline 54.62 18.67 3.13 GB ~102 (unsolvable — no temporal info)
Frame stacking from #5574 31.72 32.07 8.37 GB 297.68
This PR 67.87 15.32 4.41 GB 297.60

Screenshots

pr_figure

Environment

NVIDIA L40 (48 GB), driver 570.158.01. Newton+Warp backend (presets=newton_mjwarp,newton_renderer). skrl 2.1.0 trainer.

Checklist

  • I have read and understood the contribution guidelines
  • I have run the pre-commit checks with ./isaaclab.sh --format
  • I have made corresponding changes to the documentation (N/A - frame stacking already documented, these changes are implementation details)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the changelog and the corresponding version in the extension's config/extension.toml file
  • I have added my name to the CONTRIBUTORS.md or my name already exists there

jmart-nv added 10 commits May 28, 2026 15:16
…isaac-sim#5574.

- circular_buffer.py: replace torch.roll with O(1) write-index append; add CPU mirror of max_len and CPU bool gate to skip GPU sync on the steady-state path; rotate on read via the `buffer` property.
- observations.py (stacked_image): drop trailing `.clone()` from the permute().reshape() chain (reshape on non-contig already allocates).
- cartpole_camera_presets_env.py: same `.clone()` drop in the direct env.
- delay_buffer.py: same `.clone()` drop on the DelayBuffer.compute return.
Skip per-frame /255 + mean-subtract on the stacking path; apply once on the stacked output. Ring buffer now holds native uint8 (4x cheaper per-step copies). Math is identical because K frames live in disjoint channel slices. Affects stacked_image MDP term and the cartpole DirectRLEnv parent (new `normalize: bool = True` kwarg) + subclass.
Replace the three-pass PyTorch normalize on the camera-observation hot path with a single fused Warp kernel + small mean reduction. Consolidate the per-data-type dispatch into a shared helper used by image(), stacked_image, CartpoleCameraEnv, and CartpoleCameraPresetsEnv.

Frame-stacking subclasses pre-allocate the float32 output once and reuse it across steps, eliminating the per-step transient that the previous deferred-normalize path incurred.
The Warp kernel produces int32 partial sums along H; PyTorch collapses the partials and divides once. Channels-innermost launch shape gives coalesced reads on src's contiguous trailing dim. Partials scratch is cached per (shape, device).
- stacked_image / CartpoleCameraPresetsEnv: don't reuse a pre-allocated normalize-output buffer across env.step; the previous-iteration obs (held by the RL trainer) was overwritten before record_transition could read it, causing PPO to record degenerate transitions
- normalize_image_uint8: docstring warning on the out= aliasing hazard
- CircularBuffer.reset: setitem (= 0.0) instead of advanced-getitem + .zero_() so partial-batch resets actually zero the storage
- CircularBuffer.stacked: drop unreachable empty-buffer guard
- _uint8_sum_partials_cache: store the tensor directly, not a tuple
- Tests: regressions for both bugs, image() clone kwarg, and DelayBuffer.compute() no-aliasing contract
@github-actions github-actions Bot added the isaac-lab Related to Isaac Lab team label May 28, 2026
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This PR addresses a significant 42% throughput regression introduced by frame-stacking in #5574 for the cartpole-camera task. The fix implements several well-designed optimizations: (1) storing uint8 frames in the ring buffer instead of float32 (4× smaller), (2) deferring normalization past the frame-stack buffer, (3) replacing torch.roll with memmove-style shifting, (4) adding a fused Warp kernel for RGB normalize, and (5) providing a free contiguous stacked view via the new stack_dim parameter. The result is a 24% improvement over the pre-framestack baseline (67.87 FPS vs 54.62 FPS) with identical PPO convergence.

Findings

🔵 Suggestionsource/isaaclab/isaaclab/utils/warp/ops.py:30-46
The _uint8_sum_partials_cache is a module-level dict that grows unbounded as different (shape, device) combinations are used. Consider adding a cache eviction strategy (e.g., LRU with bounded size) or documenting the expected cache growth for long-running training jobs.

🔵 Suggestionsource/isaaclab/isaaclab/utils/buffers/circular_buffer.py:174-179
The memmove-style shift loop for i in range(k - 1): self._buffer.narrow(...).copy_(...) issues k-1 sequential copy operations. For typical frame-stacking (K=2-4), this is fine. For larger K values, a batched roll-style operation might be more efficient. Document the expected K range or add a threshold-based dispatch.

🔵 Suggestionsource/isaaclab/isaaclab/utils/images.py:67-71
The PyTorch fallback path allocates a new tensor via images.float() / 255.0 before subtracting mean. For very large batch sizes, this could be optimized with an in-place subtract after the division. Not critical for current use cases.

🟡 Warningsource/isaaclab/isaaclab/utils/warp/ops.py:59-61
The _UINT8_SUM_TILE_HW = 32 tile size is hardcoded. While the comment explains the rationale, this value may not be optimal for all GPU architectures. Consider making this configurable or auto-tuning based on device properties for future performance work.

🔵 Suggestionsource/isaaclab/isaaclab/envs/mdp/observations.py:736-740
The docstring warning about out= aliasing hazard is excellent. The implementation correctly avoids the aliasing bug by not passing out= in stacked_image.__call__, allocating fresh storage per call. This design choice is well-documented in the commit message and tested.

Test Coverage

Excellent coverage. The PR adds:

  • test_stacked_image_mdp.py: Tests for RGB buffer dtype, normalize math equivalence, and the critical consecutive-call aliasing regression test that prevents the obs overwrite bug
  • test_circular_buffer.py: 14 new tests covering stack_dim mode, reset behavior, warmup, ring overflow, and index rejection
  • test_delay_buffer.py: Regression test for compute result aliasing
  • test_images.py: Full coverage of the normalize dispatch paths and predicates
  • test_image_ops.py: Comprehensive Warp kernel tests including multi-tile, non-divisible H, and partials cache reuse

The aliasing regression test (test_consecutive_rgb_calls_return_independent_storage) specifically guards against the bug that was fixed in commit 2ddac7c.

CI Status

Pending — Most CI jobs are still running. Pre-commit, changelog fragments, and basic build checks have passed. The core test suites (isaaclab (core), isaaclab_tasks) are pending.

Verdict

Minor fixes needed — This is a well-architected performance fix with solid test coverage and clear documentation. The suggestions above are non-blocking improvements. The code correctly handles the aliasing hazard that was discovered during development, and the regression tests provide good guardrails. Ready to merge once CI completes successfully.

"""Whether ``data_type`` is one of the RGB-like camera outputs (rgb, albedo, simple_shading_*).

Args:
data_type: The camera data-type string from ``sensor.data.output`` / ``TiledCameraCfg``.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TiledCameraCfg is deprecated, it should be CameraCfg.


Raises:
ValueError: If the buffer size is less than one.
ValueError: If the buffer size is less than one, or ``stack_dim == 0``.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few lines below you throw only when is not None and stack_dim == 0. -1 means you are concatenating based on W, and to my understanding that is a legit axis to concatenate.

# Lazily allocated on the first ``append``.
self._buffer: torch.Tensor = None # type: ignore

self._stack_dim_arg: int | None = stack_dim
Copy link
Copy Markdown
Collaborator

@pbarejko pbarejko May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me some time to understand stacking terminology here. Why this is called stacking? This is like gluing frames side-by-side. Initially, I was under the impression that you actually have n-buffers represent stack, but what happens here is frame concatenation based on stack dimension?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

isaac-lab Related to Isaac Lab team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants