Fix cartpole-camera frame-stacking performance regression by jmart-nv · Pull Request #5849 · isaac-sim/IsaacLab

jmart-nv · 2026-05-28T21:44:39Z

Description

Frame-stacking was added to cartpole-camera by #5574 to make the task solvable on the Newton+Warp backend. Single-frame RGB observations don't carry pole velocity, so a single-image policy can only recover ~100/300 mean episode reward. On PhysX+RTX, implicit damping in the dynamics plus temporal antialiasing in the renderer leak enough adjacent-frame correlation that a single-frame policy still hits ~296. Newton+Warp has neither, so explicit frame stacking is required for the task to be solvable on this backend.

The implementation regressed throughput by 42% (54.6 → 31.7 FPS) and inflated peak VRAM 2.7× (3.13 → 8.37 GB) at R256, 1024 envs. PR #5574 added a CircularBuffer ring + a stacked_image MDP term that:

Stored frames as float32 even when the camera output was uint8 → 4× ring memory and 4× shift bandwidth.
Ran the per-frame normalize (x/255 - mean(x)) before the buffer, forcing the buffer to be float32.
Re-rolled the buffer with torch.roll on every append → ring-sized temporary per env.step.
Built the channel-stacked policy obs via permute + reshape + clone → fresh (B, H, W, K*C) float32 tensor per env.step.
Inlined the normalize math at 5 call sites across DirectRLEnv and ManagerBasedEnv paths.

This PR rewrites the frame-stacking hot path; the result runs 24% faster than the pre-framestack baseline with identical PPO convergence.

Change	Why
uint8 ring buffer. Defer the `x/255 − mean` normalize past the frame-stack buffer for RGB-like data types.	Ring is 4× smaller; shift bandwidth per env.step drops 4×.
Free contiguous stacked output. New `CircularBuffer.stack_dim` arg + `.stacked` property arrange storage so the channel-stacked output is a contiguous view of internal storage.	Eliminates one `(B, H, W, K*C)`-sized allocation and a permute kernel per env.step.
Drop `torch.roll` on append. Replace with a front-to-back memmove.	One fewer ring-sized allocation and one fewer kernel launch per env.step.
Fused Warp normalize kernel (`normalize_image_uint8`). One kernel: `(uint8 / 255) − per-(batch, channel) mean → float32`.	Replaces upcast + reduce + subtract (three PyTorch passes) with one fused launch over the K-stacked input.
Tiled int32 partial-sum reduction kernel (`spatial_sum_uint8_tiled`). Warp kernel writes `(B, NUM_TILES, C)` int32 partials along H; PyTorch collapses the partial dim.	int32 accumulator is safe up to H·W·255 (128× headroom at R256) and ~3× faster than int64 / float32 on PyTorch's auto-tuned reduce path. C-innermost launch shape gives stride-1 reads on src's contiguous trailing dim.
Shared `isaaclab.utils.images.normalize_camera_image` dispatch. One helper used by both `image()` (DirectRLEnv) and `stacked_image.__call__` (ManagerBasedEnv); routes uint8 contiguous inputs through the Warp fast path and falls back to PyTorch otherwise.	Removes inline normalize math duplicated across 5 sites; the fast path picks itself up automatically.
`image(clone=False)` opt-out. Callers that immediately copy into their own storage skip the defensive clone.	Saves one obs-sized allocation per env.step at the consumer site (frame-stack buffer).
Allocator-managed obs lifetime. `_get_observations` allocates the normalize output fresh each call (no persistent `out=` buffer).	The RL trainer holds a reference to the previous-iteration observation until `record_transition` returns; reusing one buffer across env.step boundaries aliases the trainer's prior obs. Fresh allocation lets PyTorch's caching allocator hand out a different block while the prior is still referenced.

Type of change

Bug fix (non-breaking change which fixes an issue)

Results

R256 (256×256), 1024 envs, PPO via skrl 2.1.0, 32k training steps, seed 42, perf measured over n=3 warm trials:

Stage	FPS	step_ms	Peak VRAM	Mean reward (32k)
Pre-framestack baseline	54.62	18.67	3.13 GB	~102 (unsolvable — no temporal info)
Frame stacking from #5574	31.72	32.07	8.37 GB	297.68
This PR	67.87	15.32	4.41 GB	297.60

Screenshots

Environment

NVIDIA L40 (48 GB), driver 570.158.01. Newton+Warp backend (presets=newton_mjwarp,newton_renderer). skrl 2.1.0 trainer.

Checklist

I have read and understood the contribution guidelines
I have run the pre-commit checks with ./isaaclab.sh --format
I have made corresponding changes to the documentation (N/A - frame stacking already documented, these changes are implementation details)
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have updated the changelog and the corresponding version in the extension's config/extension.toml file
I have added my name to the CONTRIBUTORS.md or my name already exists there

…isaac-sim#5574. - circular_buffer.py: replace torch.roll with O(1) write-index append; add CPU mirror of max_len and CPU bool gate to skip GPU sync on the steady-state path; rotate on read via the `buffer` property. - observations.py (stacked_image): drop trailing `.clone()` from the permute().reshape() chain (reshape on non-contig already allocates). - cartpole_camera_presets_env.py: same `.clone()` drop in the direct env. - delay_buffer.py: same `.clone()` drop on the DelayBuffer.compute return.

…t on append

…stack consumers

…ut reshape

Skip per-frame /255 + mean-subtract on the stacking path; apply once on the stacked output. Ring buffer now holds native uint8 (4x cheaper per-step copies). Math is identical because K frames live in disjoint channel slices. Affects stacked_image MDP term and the cartpole DirectRLEnv parent (new `normalize: bool = True` kwarg) + subclass.

Replace the three-pass PyTorch normalize on the camera-observation hot path with a single fused Warp kernel + small mean reduction. Consolidate the per-data-type dispatch into a shared helper used by image(), stacked_image, CartpoleCameraEnv, and CartpoleCameraPresetsEnv. Frame-stacking subclasses pre-allocate the float32 output once and reuse it across steps, eliminating the per-step transient that the previous deferred-normalize path incurred.

The Warp kernel produces int32 partial sums along H; PyTorch collapses the partials and divides once. Channels-innermost launch shape gives coalesced reads on src's contiguous trailing dim. Partials scratch is cached per (shape, device).

- stacked_image / CartpoleCameraPresetsEnv: don't reuse a pre-allocated normalize-output buffer across env.step; the previous-iteration obs (held by the RL trainer) was overwritten before record_transition could read it, causing PPO to record degenerate transitions - normalize_image_uint8: docstring warning on the out= aliasing hazard - CircularBuffer.reset: setitem (= 0.0) instead of advanced-getitem + .zero_() so partial-batch resets actually zero the storage - CircularBuffer.stacked: drop unreachable empty-buffer guard - _uint8_sum_partials_cache: store the tensor directly, not a tuple - Tests: regressions for both bugs, image() clone kwarg, and DelayBuffer.compute() no-aliasing contract

isaaclab-review-bot

🤖 Isaac Lab Review Bot

Summary

This PR addresses a significant 42% throughput regression introduced by frame-stacking in #5574 for the cartpole-camera task. The fix implements several well-designed optimizations: (1) storing uint8 frames in the ring buffer instead of float32 (4× smaller), (2) deferring normalization past the frame-stack buffer, (3) replacing torch.roll with memmove-style shifting, (4) adding a fused Warp kernel for RGB normalize, and (5) providing a free contiguous stacked view via the new stack_dim parameter. The result is a 24% improvement over the pre-framestack baseline (67.87 FPS vs 54.62 FPS) with identical PPO convergence.

Findings

🔵 Suggestion — source/isaaclab/isaaclab/utils/warp/ops.py:30-46
The _uint8_sum_partials_cache is a module-level dict that grows unbounded as different (shape, device) combinations are used. Consider adding a cache eviction strategy (e.g., LRU with bounded size) or documenting the expected cache growth for long-running training jobs.

🔵 Suggestion — source/isaaclab/isaaclab/utils/buffers/circular_buffer.py:174-179
The memmove-style shift loop for i in range(k - 1): self._buffer.narrow(...).copy_(...) issues k-1 sequential copy operations. For typical frame-stacking (K=2-4), this is fine. For larger K values, a batched roll-style operation might be more efficient. Document the expected K range or add a threshold-based dispatch.

🔵 Suggestion — source/isaaclab/isaaclab/utils/images.py:67-71
The PyTorch fallback path allocates a new tensor via images.float() / 255.0 before subtracting mean. For very large batch sizes, this could be optimized with an in-place subtract after the division. Not critical for current use cases.

🟡 Warning — source/isaaclab/isaaclab/utils/warp/ops.py:59-61
The _UINT8_SUM_TILE_HW = 32 tile size is hardcoded. While the comment explains the rationale, this value may not be optimal for all GPU architectures. Consider making this configurable or auto-tuning based on device properties for future performance work.

🔵 Suggestion — source/isaaclab/isaaclab/envs/mdp/observations.py:736-740
The docstring warning about out= aliasing hazard is excellent. The implementation correctly avoids the aliasing bug by not passing out= in stacked_image.__call__, allocating fresh storage per call. This design choice is well-documented in the commit message and tested.

Test Coverage

✅ Excellent coverage. The PR adds:

test_stacked_image_mdp.py: Tests for RGB buffer dtype, normalize math equivalence, and the critical consecutive-call aliasing regression test that prevents the obs overwrite bug
test_circular_buffer.py: 14 new tests covering stack_dim mode, reset behavior, warmup, ring overflow, and index rejection
test_delay_buffer.py: Regression test for compute result aliasing
test_images.py: Full coverage of the normalize dispatch paths and predicates
test_image_ops.py: Comprehensive Warp kernel tests including multi-tile, non-divisible H, and partials cache reuse

The aliasing regression test (test_consecutive_rgb_calls_return_independent_storage) specifically guards against the bug that was fixed in commit 2ddac7c.

CI Status

⏳ Pending — Most CI jobs are still running. Pre-commit, changelog fragments, and basic build checks have passed. The core test suites (isaaclab (core), isaaclab_tasks) are pending.

Verdict

Minor fixes needed — This is a well-architected performance fix with solid test coverage and clear documentation. The suggestions above are non-blocking improvements. The code correctly handles the aliasing hazard that was discovered during development, and the regression tests provide good guardrails. Ready to merge once CI completes successfully.

pbarejko · 2026-05-28T23:36:57Z

+    """Whether ``data_type`` is one of the RGB-like camera outputs (rgb, albedo, simple_shading_*).
+
+    Args:
+        data_type: The camera data-type string from ``sensor.data.output`` / ``TiledCameraCfg``.


TiledCameraCfg is deprecated, it should be CameraCfg.

pbarejko · 2026-05-29T00:03:57Z


        Raises:
-            ValueError: If the buffer size is less than one.
+            ValueError: If the buffer size is less than one, or ``stack_dim == 0``.


Few lines below you throw only when is not None and stack_dim == 0. -1 means you are concatenating based on W, and to my understanding that is a legit axis to concatenate.

pbarejko · 2026-05-29T21:21:22Z

+        # Lazily allocated on the first ``append``.
        self._buffer: torch.Tensor = None  # type: ignore

+        self._stack_dim_arg: int | None = stack_dim


It took me some time to understand stacking terminology here. Why this is called stacking? This is like gluing frames side-by-side. Initially, I was under the impression that you actually have n-buffers represent stack, but what happens here is frame concatenation based on stack dimension?

jmart-nv added 10 commits May 28, 2026 15:16

Drop read-side torch.roll from CircularBuffer.buffer via memmove-shif…

32024a4

…t on append

Add CircularBuffer.copy_to_stacked and clone=False opt-out for frame-…

347f976

…stack consumers

Add stack_dim mode to CircularBuffer for free contiguous stacked-outp…

2dad31b

…ut reshape

Changelog fragments for framestack performance regression fix

8222de6

Restore original comments that were accidentally trimmed

c554513

github-actions Bot added the isaac-lab Related to Isaac Lab team label May 28, 2026

isaaclab-review-bot Bot reviewed May 28, 2026

View reviewed changes

pbarejko reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cartpole-camera frame-stacking performance regression#5849

Fix cartpole-camera frame-stacking performance regression#5849
jmart-nv wants to merge 10 commits into
isaac-sim:developfrom
jmart-nv:jmart/framestack-perf

jmart-nv commented May 28, 2026 •

edited

Loading

Uh oh!

isaaclab-review-bot Bot left a comment

Uh oh!

pbarejko May 28, 2026

Uh oh!

pbarejko May 29, 2026

Uh oh!

pbarejko May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jmart-nv commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Results

Screenshots

Environment

Checklist

Uh oh!

isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

🤖 Isaac Lab Review Bot

Summary

Findings

Test Coverage

CI Status

Verdict

Uh oh!

pbarejko May 28, 2026

Choose a reason for hiding this comment

Uh oh!

pbarejko May 29, 2026

Choose a reason for hiding this comment

Uh oh!

pbarejko May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jmart-nv commented May 28, 2026 •

edited

Loading

pbarejko May 29, 2026 •

edited

Loading