Skip to content

[codex] Add Diff-MPPI gradient horizon benchmarks#3

Merged
rsasaki0109 merged 8 commits into
masterfrom
codex/grad-horizon-diff-mppi
May 18, 2026
Merged

[codex] Add Diff-MPPI gradient horizon benchmarks#3
rsasaki0109 merged 8 commits into
masterfrom
codex/grad-horizon-diff-mppi

Conversation

@rsasaki0109
Copy link
Copy Markdown
Owner

@rsasaki0109 rsasaki0109 commented May 17, 2026

Summary

  • grad_update_horizonPlannerVariant に追加し、Diff-MPPI の勾配ステップを先頭 N ステップだけに限定 (--override-grad-update-horizon で上書き可)。
  • diff_mppi_3_early{1,2,4,8,16} の 5 バリアント + 強フィードバックベースライン feedback_mppi_strong (feedback_mode = 10) を追加。
  • --override-dyn-speed-scale / --override-dyn-radius-scale で動的障害物の速度・半径を runtime 乗算可能に。
  • 新 scenario dynamic_pincer: 3 つの動的障害物が agent 経路の中点 (~25, 25) に収束する難 regime。
  • Sweep tooling: sweep_grad_horizon_difficulty.py / sweep_k_vs_horizon.py + 2 種の analyzer (scenario 名・take-aways データ駆動)。
  • tune_diff_mppi_time_targets.py に 3 つの preset を追加。
  • core/ HorizonSelector 契約 (HorizonSelectionRequest / HorizonRecommendation / HorizonSelector Protocol)。
  • experiments/horizon_selection/: MinimalSufficientHorizonSelector + difficulty_index (sweep CSV に対する NN lookup と k-NN robust lookup)。
  • End-to-end drivers 3 種: recommend_horizon / auto_benchmark_with_recommended_horizon / online_horizon_generalization_test (--robust flag)。

Difficulty sweep: 3 シナリオでの success 数 (18 cell 中)

planner crossing slalom pincer
mppi 0 0 0
early1 0 0 1
early2 15 0 6
early4 15 12 10
early8 15 15 10
early16 15 14 10
full (30 step) 15 13 10
  • crossing: 単一 dyn 障害物 → early2 で full と同等 (15/15)。
  • slalom: 静的障害物 6 + 降下する dyn → early2 では 0/18、early8 が最良 (15/18)。
  • pincer: 3 つの dyn obstacle が中央に収束 → easy regime は speed=0.5 付近の狭い帯のみ。

Hard regime (全 planner 不到達) で final_distance 最良の horizon:

scenario hard cell 数 best 内訳
crossing 3 early8=3
slalom 3 early8=1, early16=1, full=1
pincer 8 early8=5, full=2, early4=1

全シナリオ通じて early8 が支配的

K × horizon: 置換可能性 (pincer 1.5/1.3 hard cell)

planner h K=512 K=1024 K=2048 K=4096 K=8192
early1 1 35.44 35.34 35.37 35.42 35.37
early2 2 27.54 15.90 29.45 34.42 34.34
early4 4 5.68 5.96 5.74 5.72 5.86
early8 8 4.67 4.54 4.61 4.57 4.60
early16 16 7.56 7.45 7.81 7.17 7.39
full 30 10.80 10.41 10.46 10.66 10.57
  • K では horizon を代替できない。early1/early2 は K=8192 でも追いつけない。
  • 長い horizon が逆に悪化する regime: full の final_d が recommended の 2-2.5 倍。
  • avg_control_ms は K に支配 → compute-bound なら K を落とすのが正解、horizon=8 は実質 free。

Online generalization: minimal vs robust k-NN (k=3)

Off-grid probe 12 cell (3 scenarios × 4) で minimal-sufficient と robust k-NN を比較:

probe minimal: planner / gap robust: planner / gap 改善?
crossing (+0.25, 1.00) early2 / -0.03 early2 / -0.03 same
crossing (+1.25, 1.00) early2 / -0.01 early8 / -0.11 robust 改善
crossing (+1.75, 1.00) early8 / +0.00 early8 / +0.00 same
crossing (+0.50, 1.15) early2 / -0.03 early2 / -0.03 same
slalom (+0.25, 1.00) early4 / -0.12 early4 / -0.12 same
slalom (+1.25, 1.00) early4 / +0.04 early8 / +0.04 same gap, robust 重い
slalom (+1.75, 1.00) early8 / -0.28 early16 / -0.14 minimal 微優
slalom (+0.50, 1.45) early4 / -0.04 early4 / -0.04 same
pincer (+0.25, 1.00) early2 / -0.04 early2 / -0.04 same
pincer (+0.75, 1.00) early1 / +0.18 (FAIL) early4 / -0.08 robust が修正
pincer (+1.25, 1.30) early4 / -3.76 early8 / -3.40 minimal 微優
pincer (+1.75, 1.30) early8 / -0.28 early8 / -0.28 same

結果

  • robust mode で gap +0.05 を超える cell が 0 になった (minimal は 1 cell で +0.18)。
  • 唯一の minimal miss (pincer 0.75/1.0) を修正: minimal は corner cell (0.5, 1.0) だけで early1 が成功した事実に引きずられ early1 を推奨 → fail。robust は 3 近傍 (early1, early4, early2) の max を取り early4 → success。
  • trade-off: 2 cell で robust が minimal より僅かに悪い (slalom 1.75/1.0: gap -0.28 → -0.14、pincer 1.25/1.3: -3.76 → -3.40)。max horizon を取ることで稀に過剰になるが、failure cell をなくす方が deploy 上は重要。
  • ど ちらの mode でも gap が +0.05 以下 (= recommended が full と同等以上) を保てているのは、horizon 8 ステップが横断的な sweet spot であることの帰結。

結論

  • horizon は scenario topology の関数: crossing なら 2、slalom/pincer なら 4-8、8 ステップが横断的な sweet spot
  • K は horizon を代替しない: compute budget は horizon の精度に振るべき。
  • NN lookup + k-NN robust 選択 で sweep 外の (speed, radius) でも generalize: minimal-sufficient は 11/12 success, robust k-NN は 12/12 succeed or tie
  • core 契約化により、scenario ごとの planner 選択を データ (sweep summary) から駆動でき、minimal / robust の 2 つの policy を切り替え可能な状態。

Test plan

  • cmake --build build --target benchmark_diff_mppi がビルド通る
  • python3 scripts/sweep_grad_horizon_difficulty.py --scenarios dynamic_crossing --seeds 4 が約3分で完走
  • python3 scripts/sweep_k_vs_horizon.py --scenario dynamic_pincer --speed-scale 1.5 --radius-scale 1.3 --seeds 4 が約35秒で完走
  • python3 scripts/online_horizon_generalization_test.py (minimal) が約40秒で完走
  • python3 scripts/online_horizon_generalization_test.py --robust (k-NN k=3) が約45秒で完走
  • python3 scripts/recommend_horizon.py --summary-csv build/sweep_grad_horizon_difficulty_summary.csv が cell ごとの推奨表を出す
  • python3 scripts/auto_benchmark_with_recommended_horizon.py --summary-csv build/sweep_grad_horizon_difficulty_summary.csv --verify --cells "dynamic_crossing,1.5,1.0" が再実測まで通る
  • python3 -c "from experiments.horizon_selection.difficulty_index import recommend_for_probe_robust; print('ok')" がエラーなく終了

rsasaki0109 and others added 8 commits May 18, 2026 05:41
benchmark_diff_mppi gains --override-dyn-speed-scale and
--override-dyn-radius-scale so a single scenario (dynamic_crossing /
dynamic_slalom) can be re-used as a difficulty axis without adding new
scene definitions.

scripts/sweep_grad_horizon_difficulty.py drives the full grid
(speed_scale, radius_scale) x (mppi, diff_mppi_3_early{1,2,4,8,16},
diff_mppi_3) and aggregates per-cell success rate, final distance,
cumulative cost, collisions and avg control latency.

scripts/analyze_grad_horizon_sweep.py turns the summary CSV into a
Markdown report with success-rate / final-distance grids and a
regime classification (easy / needs e4+ / all-fail).
dynamic_pincer adds three dyn obstacles whose trajectories converge on
the corridor midpoint (~25,25), exercising the "agent meets multiple
moving obstacles in the same window" regime that single-obstacle
crossing/slalom never reaches.

analyze_grad_horizon_sweep.py now reads the scenario name from the
summary CSV (so reports for crossing/slalom/pincer label themselves
correctly) and replaces the hard-coded take-aways with data-driven
counters: per-planner success cells, share of full-horizon cells that
early2 also covers, and per-planner ranking in the all-fail regime.
core/horizon_selector_interface.py mirrors the planner-selector pattern:
- HorizonSelectionRequest: dataset, scenario, success_threshold,
  prefer_minimal, fallback_metric
- HorizonRecommendation: planner, grad_update_horizon, success_rate,
  final_distance, rationale
- HorizonSelector Protocol with recommend(rows, request)

experiments/horizon_selection/ provides:
- horizon_naming.parse_grad_update_horizon: pulls the integer horizon
  out of "diff_mppi_3_early8" -> 8, "diff_mppi_3" -> 0 (full sentinel).
- MinimalSufficientHorizonSelector: picks the smallest horizon meeting
  the success threshold; if none does, falls back to the requested
  metric (final_distance by default).

scripts/recommend_horizon.py is a thin driver that consumes the sweep
summary CSV (sweep_grad_horizon_difficulty.py output), reshapes each
(scenario, speed_scale, radius_scale) cell as a synthetic dataset, and
prints a Markdown recommendation table per cell.

Running it against the existing 3-scenario sweep reproduces the
sweep-level findings:
- dynamic_crossing easy cells -> early2 (smallest sufficient)
- dynamic_slalom -> early4 in easy cells, early8 in the only speed=1.5x
  cell that still succeeds, full in the all-fail tail
- dynamic_pincer -> early1/early2 in the narrow easy band, early4 in
  the speed=1.0 cells, early8 in the all-fail hard regime
scripts/auto_benchmark_with_recommended_horizon.py reads a sweep
summary CSV, asks MinimalSufficientHorizonSelector for a per-cell
recommendation, and emits a comparison table against the always-full
(diff_mppi_3) baseline pulled from the same CSV.

Default mode is "dry run": the comparison uses sweep numbers directly,
so the script needs no CUDA. --verify re-runs benchmark_diff_mppi for
the recommended planner + the always-full baseline on listed cells
(--cells "scenario,speed,radius;..."), so the recommendation can be
cross-checked end-to-end.

This closes the loop: the core HorizonSelector contract now drives an
end-to-end benchmark workflow, and the comparison table makes the
"recommended vs always-full" story explicit (cells where shorter
horizon matches full's quality, and cells where it actually beats
full because longer windows accumulate stale-gradient noise).
scripts/sweep_k_vs_horizon.py runs benchmark_diff_mppi across K x
planner for a fixed (scenario, speed_scale, radius_scale) cell, so
the K and horizon axes can be compared head-to-head.

scripts/analyze_k_vs_horizon.py renders Markdown grids of success
rate, final_distance and avg_control_ms over (K, planner), plus a
"substitution" table that asks per planner: does increasing K reduce
final_distance, and can the planner reach the overall best?

On the pincer 1.5/1.3 hard cell, the answer is sharply negative:
only early8 reaches the best final_d, and even at K=8192 the
shorter (early2) and longer (full) horizons remain ~4-6 units
behind. The crossing 1.5/1.0 cell is more permissive but still
shows K cannot rescue early1/early2. avg_control_ms scales with K
and is largely flat in horizon, so the compute lever is K, not
horizon.
experiments/horizon_selection/difficulty_index.py:
- load_indexed_rows: read a sweep summary CSV as
  AggregateBenchmarkRows tagged with synthetic dataset labels.
- nearest_cell: Euclidean lookup over (speed_scale, radius_scale)
  within a scenario.
- recommend_for_probe: combine the NN lookup with the
  HorizonSelector so the selector can be applied to (speed, radius)
  configurations that were not in the sweep grid.

scripts/online_horizon_generalization_test.py picks 4 off-grid probe
cells per scenario (placed between sweep points so the NN distance
is nonzero but small), asks the selector for a recommendation via
the NN lookup, runs benchmark_diff_mppi with that recommendation
plus the always-full baseline, and reports the gap.

In 11 of 12 probes the recommended planner matches or beats full on
final_distance; the one miss is pincer at speed=0.75 / radius=1.0,
where the matched cell at speed=0.5 had early1 succeed and that
recommendation did not extrapolate. This is an honest signal that
minimal-sufficient is sensitive to corner cases in the easy regime
and motivates a more conservative lookup (e.g. agreement of the
k nearest cells) as future work.
recommend_for_probe_robust polls the k nearest sweep cells and
returns the recommendation with the maximum gradient-update horizon
across them (ties broken by smallest distance). This hedges
minimal-sufficient against single-cell corner cases where a tiny
horizon happens to meet the success threshold in one sweep cell but
does not generalise to probes between that cell and tougher
neighbours.

online_horizon_generalization_test.py now takes --robust / --k flags.
With --robust the pincer (0.75, 1.0) miss is fixed: minimal picked
early1 (the only cell where early1 succeeded was the corner case at
0.5/1.0) and failed at the probe; robust polls the 3 nearest cells,
gets early1/early4/early2, and picks early4 -- which succeeds with
final_d 1.85 vs the always-full baseline's 1.93.

The trade-off is a slight regression on slalom (1.75, 1.0) where
the larger horizon (early16 vs early8) costs ~0.14 in final_d. Net:
robust has no probe with gap > +0.05 (minimal had one at +0.18).
@rsasaki0109 rsasaki0109 marked this pull request as ready for review May 18, 2026 01:28
@rsasaki0109 rsasaki0109 merged commit 553e057 into master May 18, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant