[codex] Add Diff-MPPI gradient horizon benchmarks by rsasaki0109 · Pull Request #3 · rsasaki0109/CudaRobotics

rsasaki0109 · 2026-05-17T21:26:39Z

Summary

grad_update_horizon を PlannerVariant に追加し、Diff-MPPI の勾配ステップを先頭 N ステップだけに限定 (--override-grad-update-horizon で上書き可)。
diff_mppi_3_early{1,2,4,8,16} の 5 バリアント + 強フィードバックベースライン feedback_mppi_strong (feedback_mode = 10) を追加。
--override-dyn-speed-scale / --override-dyn-radius-scale で動的障害物の速度・半径を runtime 乗算可能に。
新 scenario dynamic_pincer: 3 つの動的障害物が agent 経路の中点 (~25, 25) に収束する難 regime。
Sweep tooling: sweep_grad_horizon_difficulty.py / sweep_k_vs_horizon.py + 2 種の analyzer (scenario 名・take-aways データ駆動)。
tune_diff_mppi_time_targets.py に 3 つの preset を追加。
core/ HorizonSelector 契約 (HorizonSelectionRequest / HorizonRecommendation / HorizonSelector Protocol)。
experiments/horizon_selection/: MinimalSufficientHorizonSelector + difficulty_index (sweep CSV に対する NN lookup と k-NN robust lookup)。
End-to-end drivers 3 種: recommend_horizon / auto_benchmark_with_recommended_horizon / online_horizon_generalization_test (--robust flag)。

Difficulty sweep: 3 シナリオでの success 数 (18 cell 中)

planner	crossing	slalom	pincer
mppi	0	0	0
early1	0	0	1
early2	15	0	6
early4	15	12	10
early8	15	15	10
early16	15	14	10
full (30 step)	15	13	10

crossing: 単一 dyn 障害物 → early2 で full と同等 (15/15)。
slalom: 静的障害物 6 + 降下する dyn → early2 では 0/18、early8 が最良 (15/18)。
pincer: 3 つの dyn obstacle が中央に収束 → easy regime は speed=0.5 付近の狭い帯のみ。

Hard regime (全 planner 不到達) で final_distance 最良の horizon:

scenario	hard cell 数	best 内訳
crossing	3	early8=3
slalom	3	early8=1, early16=1, full=1
pincer	8	early8=5, full=2, early4=1

→ 全シナリオ通じて early8 が支配的。

K × horizon: 置換可能性 (pincer 1.5/1.3 hard cell)

planner	h	K=512	K=1024	K=2048	K=4096	K=8192
early1	1	35.44	35.34	35.37	35.42	35.37
early2	2	27.54	15.90	29.45	34.42	34.34
early4	4	5.68	5.96	5.74	5.72	5.86
early8	8	4.67	4.54	4.61	4.57	4.60
early16	16	7.56	7.45	7.81	7.17	7.39
full	30	10.80	10.41	10.46	10.66	10.57

K では horizon を代替できない。early1/early2 は K=8192 でも追いつけない。
長い horizon が逆に悪化する regime: full の final_d が recommended の 2-2.5 倍。
avg_control_ms は K に支配 → compute-bound なら K を落とすのが正解、horizon=8 は実質 free。

Online generalization: minimal vs robust k-NN (k=3)

Off-grid probe 12 cell (3 scenarios × 4) で minimal-sufficient と robust k-NN を比較:

probe	minimal: planner / gap	robust: planner / gap	改善?
crossing (+0.25, 1.00)	early2 / -0.03	early2 / -0.03	same
crossing (+1.25, 1.00)	early2 / -0.01	early8 / -0.11	robust 改善
crossing (+1.75, 1.00)	early8 / +0.00	early8 / +0.00	same
crossing (+0.50, 1.15)	early2 / -0.03	early2 / -0.03	same
slalom (+0.25, 1.00)	early4 / -0.12	early4 / -0.12	same
slalom (+1.25, 1.00)	early4 / +0.04	early8 / +0.04	same gap, robust 重い
slalom (+1.75, 1.00)	early8 / -0.28	early16 / -0.14	minimal 微優
slalom (+0.50, 1.45)	early4 / -0.04	early4 / -0.04	same
pincer (+0.25, 1.00)	early2 / -0.04	early2 / -0.04	same
pincer (+0.75, 1.00)	early1 / +0.18 (FAIL)	early4 / -0.08	robust が修正
pincer (+1.25, 1.30)	early4 / -3.76	early8 / -3.40	minimal 微優
pincer (+1.75, 1.30)	early8 / -0.28	early8 / -0.28	same

結果

robust mode で gap +0.05 を超える cell が 0 になった (minimal は 1 cell で +0.18)。
唯一の minimal miss (pincer 0.75/1.0) を修正: minimal は corner cell (0.5, 1.0) だけで early1 が成功した事実に引きずられ early1 を推奨 → fail。robust は 3 近傍 (early1, early4, early2) の max を取り early4 → success。
trade-off: 2 cell で robust が minimal より僅かに悪い (slalom 1.75/1.0: gap -0.28 → -0.14、pincer 1.25/1.3: -3.76 → -3.40)。max horizon を取ることで稀に過剰になるが、failure cell をなくす方が deploy 上は重要。
どちらの mode でも gap が +0.05 以下 (= recommended が full と同等以上) を保てているのは、horizon 8 ステップが横断的な sweet spot であることの帰結。

結論

horizon は scenario topology の関数: crossing なら 2、slalom/pincer なら 4-8、8 ステップが横断的な sweet spot。
K は horizon を代替しない: compute budget は horizon の精度に振るべき。
NN lookup + k-NN robust 選択 で sweep 外の (speed, radius) でも generalize: minimal-sufficient は 11/12 success, robust k-NN は 12/12 succeed or tie。
core 契約化により、scenario ごとの planner 選択を データ (sweep summary) から駆動でき、minimal / robust の 2 つの policy を切り替え可能な状態。

Test plan

cmake --build build --target benchmark_diff_mppi がビルド通る
python3 scripts/sweep_grad_horizon_difficulty.py --scenarios dynamic_crossing --seeds 4 が約3分で完走
python3 scripts/sweep_k_vs_horizon.py --scenario dynamic_pincer --speed-scale 1.5 --radius-scale 1.3 --seeds 4 が約35秒で完走
python3 scripts/online_horizon_generalization_test.py (minimal) が約40秒で完走
python3 scripts/online_horizon_generalization_test.py --robust (k-NN k=3) が約45秒で完走
python3 scripts/recommend_horizon.py --summary-csv build/sweep_grad_horizon_difficulty_summary.csv が cell ごとの推奨表を出す
python3 scripts/auto_benchmark_with_recommended_horizon.py --summary-csv build/sweep_grad_horizon_difficulty_summary.csv --verify --cells "dynamic_crossing,1.5,1.0" が再実測まで通る
python3 -c "from experiments.horizon_selection.difficulty_index import recommend_for_probe_robust; print('ok')" がエラーなく終了

benchmark_diff_mppi gains --override-dyn-speed-scale and --override-dyn-radius-scale so a single scenario (dynamic_crossing / dynamic_slalom) can be re-used as a difficulty axis without adding new scene definitions. scripts/sweep_grad_horizon_difficulty.py drives the full grid (speed_scale, radius_scale) x (mppi, diff_mppi_3_early{1,2,4,8,16}, diff_mppi_3) and aggregates per-cell success rate, final distance, cumulative cost, collisions and avg control latency. scripts/analyze_grad_horizon_sweep.py turns the summary CSV into a Markdown report with success-rate / final-distance grids and a regime classification (easy / needs e4+ / all-fail).

dynamic_pincer adds three dyn obstacles whose trajectories converge on the corridor midpoint (~25,25), exercising the "agent meets multiple moving obstacles in the same window" regime that single-obstacle crossing/slalom never reaches. analyze_grad_horizon_sweep.py now reads the scenario name from the summary CSV (so reports for crossing/slalom/pincer label themselves correctly) and replaces the hard-coded take-aways with data-driven counters: per-planner success cells, share of full-horizon cells that early2 also covers, and per-planner ranking in the all-fail regime.

core/horizon_selector_interface.py mirrors the planner-selector pattern: - HorizonSelectionRequest: dataset, scenario, success_threshold, prefer_minimal, fallback_metric - HorizonRecommendation: planner, grad_update_horizon, success_rate, final_distance, rationale - HorizonSelector Protocol with recommend(rows, request) experiments/horizon_selection/ provides: - horizon_naming.parse_grad_update_horizon: pulls the integer horizon out of "diff_mppi_3_early8" -> 8, "diff_mppi_3" -> 0 (full sentinel). - MinimalSufficientHorizonSelector: picks the smallest horizon meeting the success threshold; if none does, falls back to the requested metric (final_distance by default). scripts/recommend_horizon.py is a thin driver that consumes the sweep summary CSV (sweep_grad_horizon_difficulty.py output), reshapes each (scenario, speed_scale, radius_scale) cell as a synthetic dataset, and prints a Markdown recommendation table per cell. Running it against the existing 3-scenario sweep reproduces the sweep-level findings: - dynamic_crossing easy cells -> early2 (smallest sufficient) - dynamic_slalom -> early4 in easy cells, early8 in the only speed=1.5x cell that still succeeds, full in the all-fail tail - dynamic_pincer -> early1/early2 in the narrow easy band, early4 in the speed=1.0 cells, early8 in the all-fail hard regime

scripts/auto_benchmark_with_recommended_horizon.py reads a sweep summary CSV, asks MinimalSufficientHorizonSelector for a per-cell recommendation, and emits a comparison table against the always-full (diff_mppi_3) baseline pulled from the same CSV. Default mode is "dry run": the comparison uses sweep numbers directly, so the script needs no CUDA. --verify re-runs benchmark_diff_mppi for the recommended planner + the always-full baseline on listed cells (--cells "scenario,speed,radius;..."), so the recommendation can be cross-checked end-to-end. This closes the loop: the core HorizonSelector contract now drives an end-to-end benchmark workflow, and the comparison table makes the "recommended vs always-full" story explicit (cells where shorter horizon matches full's quality, and cells where it actually beats full because longer windows accumulate stale-gradient noise).

scripts/sweep_k_vs_horizon.py runs benchmark_diff_mppi across K x planner for a fixed (scenario, speed_scale, radius_scale) cell, so the K and horizon axes can be compared head-to-head. scripts/analyze_k_vs_horizon.py renders Markdown grids of success rate, final_distance and avg_control_ms over (K, planner), plus a "substitution" table that asks per planner: does increasing K reduce final_distance, and can the planner reach the overall best? On the pincer 1.5/1.3 hard cell, the answer is sharply negative: only early8 reaches the best final_d, and even at K=8192 the shorter (early2) and longer (full) horizons remain ~4-6 units behind. The crossing 1.5/1.0 cell is more permissive but still shows K cannot rescue early1/early2. avg_control_ms scales with K and is largely flat in horizon, so the compute lever is K, not horizon.

experiments/horizon_selection/difficulty_index.py: - load_indexed_rows: read a sweep summary CSV as AggregateBenchmarkRows tagged with synthetic dataset labels. - nearest_cell: Euclidean lookup over (speed_scale, radius_scale) within a scenario. - recommend_for_probe: combine the NN lookup with the HorizonSelector so the selector can be applied to (speed, radius) configurations that were not in the sweep grid. scripts/online_horizon_generalization_test.py picks 4 off-grid probe cells per scenario (placed between sweep points so the NN distance is nonzero but small), asks the selector for a recommendation via the NN lookup, runs benchmark_diff_mppi with that recommendation plus the always-full baseline, and reports the gap. In 11 of 12 probes the recommended planner matches or beats full on final_distance; the one miss is pincer at speed=0.75 / radius=1.0, where the matched cell at speed=0.5 had early1 succeed and that recommendation did not extrapolate. This is an honest signal that minimal-sufficient is sensitive to corner cases in the easy regime and motivates a more conservative lookup (e.g. agreement of the k nearest cells) as future work.

recommend_for_probe_robust polls the k nearest sweep cells and returns the recommendation with the maximum gradient-update horizon across them (ties broken by smallest distance). This hedges minimal-sufficient against single-cell corner cases where a tiny horizon happens to meet the success threshold in one sweep cell but does not generalise to probes between that cell and tougher neighbours. online_horizon_generalization_test.py now takes --robust / --k flags. With --robust the pincer (0.75, 1.0) miss is fixed: minimal picked early1 (the only cell where early1 succeeded was the corner case at 0.5/1.0) and failed at the probe; robust polls the 3 nearest cells, gets early1/early4/early2, and picks early4 -- which succeeds with final_d 1.85 vs the always-full baseline's 1.93. The trade-off is a slight regression on slalom (1.75, 1.0) where the larger horizon (early16 vs early8) costs ~0.14 in final_d. Net: robust has no probe with gap > +0.05 (minimal had one at +0.18).

rsasaki0109 and others added 8 commits May 18, 2026 05:41

Add Diff-MPPI gradient horizon benchmarks

a9cfb77

rsasaki0109 marked this pull request as ready for review May 18, 2026 01:28

rsasaki0109 merged commit 553e057 into master May 18, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Add Diff-MPPI gradient horizon benchmarks#3

[codex] Add Diff-MPPI gradient horizon benchmarks#3
rsasaki0109 merged 8 commits into
masterfrom
codex/grad-horizon-diff-mppi

rsasaki0109 commented May 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rsasaki0109 commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Difficulty sweep: 3 シナリオでの success 数 (18 cell 中)

K × horizon: 置換可能性 (pincer 1.5/1.3 hard cell)

Online generalization: minimal vs robust k-NN (k=3)

結果

結論

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rsasaki0109 commented May 17, 2026 •

edited

Loading