[codex] Add Diff-MPPI gradient horizon benchmarks#3
Merged
Conversation
benchmark_diff_mppi gains --override-dyn-speed-scale and
--override-dyn-radius-scale so a single scenario (dynamic_crossing /
dynamic_slalom) can be re-used as a difficulty axis without adding new
scene definitions.
scripts/sweep_grad_horizon_difficulty.py drives the full grid
(speed_scale, radius_scale) x (mppi, diff_mppi_3_early{1,2,4,8,16},
diff_mppi_3) and aggregates per-cell success rate, final distance,
cumulative cost, collisions and avg control latency.
scripts/analyze_grad_horizon_sweep.py turns the summary CSV into a
Markdown report with success-rate / final-distance grids and a
regime classification (easy / needs e4+ / all-fail).
dynamic_pincer adds three dyn obstacles whose trajectories converge on the corridor midpoint (~25,25), exercising the "agent meets multiple moving obstacles in the same window" regime that single-obstacle crossing/slalom never reaches. analyze_grad_horizon_sweep.py now reads the scenario name from the summary CSV (so reports for crossing/slalom/pincer label themselves correctly) and replaces the hard-coded take-aways with data-driven counters: per-planner success cells, share of full-horizon cells that early2 also covers, and per-planner ranking in the all-fail regime.
core/horizon_selector_interface.py mirrors the planner-selector pattern: - HorizonSelectionRequest: dataset, scenario, success_threshold, prefer_minimal, fallback_metric - HorizonRecommendation: planner, grad_update_horizon, success_rate, final_distance, rationale - HorizonSelector Protocol with recommend(rows, request) experiments/horizon_selection/ provides: - horizon_naming.parse_grad_update_horizon: pulls the integer horizon out of "diff_mppi_3_early8" -> 8, "diff_mppi_3" -> 0 (full sentinel). - MinimalSufficientHorizonSelector: picks the smallest horizon meeting the success threshold; if none does, falls back to the requested metric (final_distance by default). scripts/recommend_horizon.py is a thin driver that consumes the sweep summary CSV (sweep_grad_horizon_difficulty.py output), reshapes each (scenario, speed_scale, radius_scale) cell as a synthetic dataset, and prints a Markdown recommendation table per cell. Running it against the existing 3-scenario sweep reproduces the sweep-level findings: - dynamic_crossing easy cells -> early2 (smallest sufficient) - dynamic_slalom -> early4 in easy cells, early8 in the only speed=1.5x cell that still succeeds, full in the all-fail tail - dynamic_pincer -> early1/early2 in the narrow easy band, early4 in the speed=1.0 cells, early8 in the all-fail hard regime
scripts/auto_benchmark_with_recommended_horizon.py reads a sweep summary CSV, asks MinimalSufficientHorizonSelector for a per-cell recommendation, and emits a comparison table against the always-full (diff_mppi_3) baseline pulled from the same CSV. Default mode is "dry run": the comparison uses sweep numbers directly, so the script needs no CUDA. --verify re-runs benchmark_diff_mppi for the recommended planner + the always-full baseline on listed cells (--cells "scenario,speed,radius;..."), so the recommendation can be cross-checked end-to-end. This closes the loop: the core HorizonSelector contract now drives an end-to-end benchmark workflow, and the comparison table makes the "recommended vs always-full" story explicit (cells where shorter horizon matches full's quality, and cells where it actually beats full because longer windows accumulate stale-gradient noise).
scripts/sweep_k_vs_horizon.py runs benchmark_diff_mppi across K x planner for a fixed (scenario, speed_scale, radius_scale) cell, so the K and horizon axes can be compared head-to-head. scripts/analyze_k_vs_horizon.py renders Markdown grids of success rate, final_distance and avg_control_ms over (K, planner), plus a "substitution" table that asks per planner: does increasing K reduce final_distance, and can the planner reach the overall best? On the pincer 1.5/1.3 hard cell, the answer is sharply negative: only early8 reaches the best final_d, and even at K=8192 the shorter (early2) and longer (full) horizons remain ~4-6 units behind. The crossing 1.5/1.0 cell is more permissive but still shows K cannot rescue early1/early2. avg_control_ms scales with K and is largely flat in horizon, so the compute lever is K, not horizon.
experiments/horizon_selection/difficulty_index.py: - load_indexed_rows: read a sweep summary CSV as AggregateBenchmarkRows tagged with synthetic dataset labels. - nearest_cell: Euclidean lookup over (speed_scale, radius_scale) within a scenario. - recommend_for_probe: combine the NN lookup with the HorizonSelector so the selector can be applied to (speed, radius) configurations that were not in the sweep grid. scripts/online_horizon_generalization_test.py picks 4 off-grid probe cells per scenario (placed between sweep points so the NN distance is nonzero but small), asks the selector for a recommendation via the NN lookup, runs benchmark_diff_mppi with that recommendation plus the always-full baseline, and reports the gap. In 11 of 12 probes the recommended planner matches or beats full on final_distance; the one miss is pincer at speed=0.75 / radius=1.0, where the matched cell at speed=0.5 had early1 succeed and that recommendation did not extrapolate. This is an honest signal that minimal-sufficient is sensitive to corner cases in the easy regime and motivates a more conservative lookup (e.g. agreement of the k nearest cells) as future work.
recommend_for_probe_robust polls the k nearest sweep cells and returns the recommendation with the maximum gradient-update horizon across them (ties broken by smallest distance). This hedges minimal-sufficient against single-cell corner cases where a tiny horizon happens to meet the success threshold in one sweep cell but does not generalise to probes between that cell and tougher neighbours. online_horizon_generalization_test.py now takes --robust / --k flags. With --robust the pincer (0.75, 1.0) miss is fixed: minimal picked early1 (the only cell where early1 succeeded was the corner case at 0.5/1.0) and failed at the probe; robust polls the 3 nearest cells, gets early1/early4/early2, and picks early4 -- which succeeds with final_d 1.85 vs the always-full baseline's 1.93. The trade-off is a slight regression on slalom (1.75, 1.0) where the larger horizon (early16 vs early8) costs ~0.14 in final_d. Net: robust has no probe with gap > +0.05 (minimal had one at +0.18).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
grad_update_horizonをPlannerVariantに追加し、Diff-MPPI の勾配ステップを先頭 N ステップだけに限定 (--override-grad-update-horizonで上書き可)。diff_mppi_3_early{1,2,4,8,16}の 5 バリアント + 強フィードバックベースラインfeedback_mppi_strong(feedback_mode = 10) を追加。--override-dyn-speed-scale/--override-dyn-radius-scaleで動的障害物の速度・半径を runtime 乗算可能に。dynamic_pincer: 3 つの動的障害物が agent 経路の中点 (~25, 25) に収束する難 regime。sweep_grad_horizon_difficulty.py/sweep_k_vs_horizon.py+ 2 種の analyzer (scenario 名・take-aways データ駆動)。tune_diff_mppi_time_targets.pyに 3 つの preset を追加。HorizonSelector契約 (HorizonSelectionRequest/HorizonRecommendation/HorizonSelectorProtocol)。experiments/horizon_selection/:MinimalSufficientHorizonSelector+difficulty_index(sweep CSV に対する NN lookup と k-NN robust lookup)。recommend_horizon/auto_benchmark_with_recommended_horizon/online_horizon_generalization_test(--robustflag)。Difficulty sweep: 3 シナリオでの success 数 (18 cell 中)
Hard regime (全 planner 不到達) で final_distance 最良の horizon:
→ 全シナリオ通じて early8 が支配的。
K × horizon: 置換可能性 (pincer 1.5/1.3 hard cell)
Online generalization: minimal vs robust k-NN (k=3)
Off-grid probe 12 cell (3 scenarios × 4) で minimal-sufficient と robust k-NN を比較:
結果
結論
Test plan
cmake --build build --target benchmark_diff_mppiがビルド通るpython3 scripts/sweep_grad_horizon_difficulty.py --scenarios dynamic_crossing --seeds 4が約3分で完走python3 scripts/sweep_k_vs_horizon.py --scenario dynamic_pincer --speed-scale 1.5 --radius-scale 1.3 --seeds 4が約35秒で完走python3 scripts/online_horizon_generalization_test.py(minimal) が約40秒で完走python3 scripts/online_horizon_generalization_test.py --robust(k-NN k=3) が約45秒で完走python3 scripts/recommend_horizon.py --summary-csv build/sweep_grad_horizon_difficulty_summary.csvが cell ごとの推奨表を出すpython3 scripts/auto_benchmark_with_recommended_horizon.py --summary-csv build/sweep_grad_horizon_difficulty_summary.csv --verify --cells "dynamic_crossing,1.5,1.0"が再実測まで通るpython3 -c "from experiments.horizon_selection.difficulty_index import recommend_for_probe_robust; print('ok')"がエラーなく終了