feat(bench): accuracy --repeat support + auto-on-PR bench workflow#34
Conversation
Closes the residual #31/#33 gap (single-sample timing in accuracy mode) and adds an auto-on-PR bench comment workflow so future optimizer PRs get a four-section impact report without manual workflow_dispatch. ## accuracy mode honors --repeat / --warmup `bench/runner/modes/accuracy.py`: - `_run_one_accuracy_case` now accepts `iteration: int`; previously hardcoded `iteration: 0` so multi-iter runs would have all entries collide on iteration=0. - `run_accuracy` gains `repeat: int = 1, warmup: int = 0` params and a warmup-then-repeat loop matching quick/timing mode shape. Warmup iterations are discarded; measured iterations populate the result list with the correct iteration index. - `run_accuracy_sync` plumbs `repeat`/`warmup` through. `bench/runner/cli.py`: pass `args.repeat`/`args.warmup` into the accuracy sync wrapper and into the recorded `config` dict so the JSON faithfully records what ran. `tests/bench/test_accuracy.py`: 2 new tests covering the repeat loop and warmup-iter discard. This unblocks Welch's t-test for timing on accuracy mode. With n>=3 both sides, the stats gate fires instead of the 25% noise-floor fallback that opened #31 and #33. ## bench-baseline-update.yml — --repeat 2 --warmup 0, timeout 50 min Accuracy mode at --repeat 1 took ~22 min CI. At --repeat 3 + warmup 1 that would be ~88 min, over the 40-min timeout. Picked --repeat 2 + warmup 0 (≈44 min) as the practical sweet spot: Welch's t-test at n=2 is statistically weak but better than noise-floor; timeout bumped 40 to 50 min for headroom. ## New bench-pr.yml workflow `.github/workflows/bench-pr.yml`: - Auto-runs on pull_request open/synchronize/reopen with the same path filter as bench-baseline-update.yml. - Runs `bench.run --mode accuracy --repeat 2 --warmup 0`, then `bench.compare reports/baseline.core.json reports/_head.json`. - Posts a sticky comment via `<!-- pare-bench-comment -->` signature and peter-evans/find-comment + create-or-update-comment. - Concurrency `cancel-in-progress: true` cancels older runs when new commits push, so the comment always reflects the latest commit. - Fails CI on regression in any axis (timing/compression/estimation/ errors), turning the bench delta into an actual merge gate. Docs: bench/CLAUDE.md "CI integration" section refreshed to describe all three workflows; removed the stale "accuracy mode currently ignores --repeat" note now that it's no longer true. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR closes the timing-noise residual in accuracy mode by making it honor --repeat/--warmup, retunes the baseline-update workflow to use --repeat 2 --warmup 0 within a 50-minute budget, and introduces a new bench-pr.yml workflow that posts a sticky four-section bench comment on every PR touching optimizer/estimator/bench paths and fails CI on regression.
Changes:
run_accuracynow executeswarmup+repeatiterations per case, with iteration indices recorded; CLI wires--warmup/--repeatinto both the run and the recorded config.bench-baseline-update.ymlswitches to--repeat 2 --warmup 0(timeout bumped to 50 min) so Welch's t-test gets n=2 instead of falling into the 25% noise-floor path.- New
bench-pr.ymlbuilds the PR image, runs accuracy mode, compares against the pinned baseline, posts/updates a sticky PR comment, and fails on regression.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| bench/runner/modes/accuracy.py | Add iteration arg and warmup/repeat loop in run_accuracy/run_accuracy_sync. |
| bench/runner/cli.py | Pass args.repeat/args.warmup into accuracy mode and into the recorded config. |
| tests/bench/test_accuracy.py | Add tests covering repeat=3 indices and discarded warmup iterations. |
| .github/workflows/bench-baseline-update.yml | Drop forward-compat flags to --repeat 2 --warmup 0, raise timeout to 50 min, refresh comment. |
| .github/workflows/bench-pr.yml | New workflow: auto bench compare + sticky PR comment + regression gate. |
| bench/CLAUDE.md | Document third workflow and updated baseline recipe. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| config = { | ||
| "warmup": args.warmup, | ||
| "repeat": args.repeat, | ||
| "stages": ["estimate", "optimize"], | ||
| } | ||
| iterations = run_accuracy_sync(cases, repeat=args.repeat, warmup=args.warmup) |
Pare bench — PR #34Run: 26046009297 · Commit: Bench compare: baseline.core.json → _head.jsonthreshold=10.0%, noise-floor=25.0%, noise-floor-min=5.0ms, α=0.05, cases compared=285, regressions=0, noise_floor_flags=1, improvements=0 Compare conditions
TimingPer-format summary
Per-case detail (285 cases)
Compressionreduction_threshold=3.0pp, size_threshold=5.0%
Estimationestimation_threshold=10.0pp
Auto-posted by |
Two compare-engine refinements driven by the first bench-pr.yml run on this PR (#34): the timing gate fired on noise but no real regressions, so tune the gate. ## Fix A — _STATS_MIN_ITERS 3 → 2 Welch's t-test at n=2 has 1 degree of freedom — wide CI, statistically weak. But it correctly returns p>0.05 for noisy cases (no false positive) and clears for tight ones. Better than the dumb 25% threshold in both directions. Matches accuracy mode's natural rhythm (the PR ships --repeat 2 for that mode given timeout budget). ## Fix C — noise-floor gate skips cases below absolute-ms floor Cases with baseline median below 5 ms (configurable via the new --noise-floor-min-ms flag) are not flagged by the relative gate. A 0.04 ms → 0.07 ms case is measurement quantization, not signal — the AVIF "already-optimized" early-skip path produced these at high volume in the first bench-pr run on #34 (9 of 13 flagged cases were sub-0.1 ms). Surfaced in the markdown header alongside threshold-pct and noise-floor-pct so PR comment readers see the gate definition. 3 new tests: - test_noise_floor_skipped_below_min_ms: 0.5→0.7ms case (+40%) doesn't flag - test_noise_floor_fires_at_or_above_min_ms: 100→140ms case (+40%) still flags - test_stats_engages_at_n_equals_2: 2 iters each side routes through stats path ## Empirical impact on PR #34's own bench run Re-running compare against the artifact from run 26044094317: - Before: 13 noise_floor_flags (9 sub-0.1ms AVIF + 4 BMP CPU-steal) - After: 3 noise_floor_flags (sub-ms quantization gone; CPU-steal remains) The 3 surviving flags are real shared-CI noise on single-threaded multi-100ms BMP cases — a separate axis (per-format noise floor or CPU-steal isolation) that's out of scope here. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
✅ Test & Coverage ReportStatus: PASS | Tests: 1306 passed, 0 failed, 0 errors (1306 total) 🟢 Overall Coverage: 95.0%
Report generated at 2026-05-18 16:23:01 UTC from commit 334ff45 |
…bench-baseline] The previous baseline.core.json was generated at --repeat 1 (pre-PR #34's accuracy-mode --repeat support). After #34 merged, the auto bench-baseline- update workflow ran the candidate at --repeat 2 but couldn't promote because the n=1 vs n=2 mismatch routes the compare to the noise-floor path, where single-threaded BMP cases on shared CI tripped 2 false-positive flags (opened as drift issue #35). Adopt the candidate from CI run 26048342324 directly as the pinned baseline so future comparisons are n=2 vs n=2 and engage Welch's t-test properly, escaping the bootstrap cycle. [skip bench-baseline] guards against the workflow firing on this commit and re-opening another drift issue. Closes #35. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Three complementary bench-CI improvements, bundled:
accuracymode honors--repeat/--warmup— closes the residual gap from bench: baseline drift detected on main@56f9595ebcae66c4be93b3dcaab4a882b5998772 #31/bench: baseline drift detected on main@9e72ef433178240265e1af52a0826b62eb7ff997 #33 where timing comparisons in accuracy mode fell through to the 25% noise-floor path and produced false-positive drift issues on shared CI.bench-pr.ymlworkflow — auto-posts a sticky four-section bench comment (Timing / Compression / Estimation / Errors) on every PR that touches optimizer/estimator/bench paths. Fails CI on regression in any axis.Together: future optimizer PRs land with an automatic impact report; baseline-update no longer fires drift issues on pure scheduler noise; the noise-floor fallback no longer over-fires on AVIF early-skip cases.
What changed
accuracy mode
--repeat(closes the timing-noise residual)bench/runner/modes/accuracy.py:_run_one_accuracy_case(case, iteration=0)— accepts the iteration index instead of hardcoding 0.run_accuracy(cases, repeat=1, warmup=0)— new params; warmup-then-repeat loop matching quick/timing mode. Warmup results discarded.run_accuracy_syncplumbs both params.bench/runner/cli.py: passesargs.repeat/args.warmupinto the accuracy sync wrapper AND into the recordedconfigdict so the JSON faithfully reflects what ran.tests/bench/test_accuracy.py: 2 new tests covering the repeat loop and warmup-iter discard.bench-baseline-update.yml — sensible defaults
--repeat 1(effectively, before this PR) →--repeat 2 --warmup 0. Welch's t-test minimum is now n=2 (see Fix A below); running 3 iters would put us at ~66 min, over budget. Timeout bumped 40 → 50 min for headroom.New
bench-pr.yml— auto bench comment on PRspull_request(open/synchronize/reopen)bench.run+bench.compare reports/baseline.core.json reports/_head.json.peter-evans/find-comment+create-or-update-comment(matching bench.yml's pattern).<!-- pare-bench-comment -->signature keeps repeat pushes updating the same comment.cancel-in-progress: truecancels older runs when new commits push.Compare-engine noise tuning (commit 334ff45)
Fix A:
_STATS_MIN_ITERSfrom 3 → 2 incompare.py. Welch's t-test at n=2 has 1 degree of freedom — wide CI but functional. p > 0.05 on noisy cases (correct: no false positive), p < 0.05 on tight ones. Matches accuracy mode's--repeat 2.Fix C: new
--noise-floor-min-msflag (default 5.0). When a case's baseline median is below this threshold, the relative noise-floor gate is skipped — a0.04ms → 0.07ms = +75%delta is measurement quantization, not signal. The AVIF early-skip path produced 9 such false positives in this PR's first bench-pr run. Surfaced in the markdown header so reviewers see the gate definition.3 new tests covering both fixes.
Empirical impact on this PR's own bench run
The first
bench-pr.ymlrun on commite55860bflagged 13 noise_floor_flags. After applying A + C and re-comparing the same artifact: 13 → 3 noise_floor_flags. The 9 sub-millisecond AVIF false-positives are gone. The remaining 3 are real BMP single-threaded multi-100ms cases hit by shared-CI CPU steal — a separate axis worth tackling later but not blocking.Docs
bench/CLAUDE.md"CI integration" section updated to describe all three workflows; removed the stale "accuracy mode currently ignores --repeat" note.Test plan
pytest tests/bench/— 520 passed, 2 skipped (was 500 at PR feat(bench): three-axis CI gating — timing, compression, estimation #32 merge; +20 across this PR's additions)pytest tests/bench/test_compare.py -v— 36 passingruff check+black --checkcleanbench-pr.ymlrun on this PR posted the four-section comment, confirming the auto-on-PR workflow works end-to-endKnown follow-up
3 BMP single-threaded cases at 200ms-1s wall time may still flag on shared CI from real CPU-steal noise. Possible follow-ups (not in this PR): per-format noise-floor thresholds, or isolation runs for the BMP class.
Closes #33.
🤖 Generated with Claude Code