Skip to content

feat(bench): accuracy --repeat support + auto-on-PR bench workflow#34

Merged
amitray007 merged 2 commits into
mainfrom
bench-hardening-v2
May 18, 2026
Merged

feat(bench): accuracy --repeat support + auto-on-PR bench workflow#34
amitray007 merged 2 commits into
mainfrom
bench-hardening-v2

Conversation

@amitray007
Copy link
Copy Markdown
Owner

@amitray007 amitray007 commented May 18, 2026

Summary

Three complementary bench-CI improvements, bundled:

  1. accuracy mode honors --repeat / --warmup — closes the residual gap from bench: baseline drift detected on main@56f9595ebcae66c4be93b3dcaab4a882b5998772 #31/bench: baseline drift detected on main@9e72ef433178240265e1af52a0826b62eb7ff997 #33 where timing comparisons in accuracy mode fell through to the 25% noise-floor path and produced false-positive drift issues on shared CI.
  2. New bench-pr.yml workflow — auto-posts a sticky four-section bench comment (Timing / Compression / Estimation / Errors) on every PR that touches optimizer/estimator/bench paths. Fails CI on regression in any axis.
  3. Compare-engine noise tuning (added after the first bench-pr run on this PR exposed a calibration gap) — drops the stats-path minimum from 3 to 2 iterations (matching accuracy mode's natural rhythm) and adds an absolute-ms floor on the noise-floor gate so sub-millisecond measurement quantization stops flagging.

Together: future optimizer PRs land with an automatic impact report; baseline-update no longer fires drift issues on pure scheduler noise; the noise-floor fallback no longer over-fires on AVIF early-skip cases.

What changed

accuracy mode --repeat (closes the timing-noise residual)

bench/runner/modes/accuracy.py:

  • _run_one_accuracy_case(case, iteration=0) — accepts the iteration index instead of hardcoding 0.
  • run_accuracy(cases, repeat=1, warmup=0) — new params; warmup-then-repeat loop matching quick/timing mode. Warmup results discarded.
  • run_accuracy_sync plumbs both params.

bench/runner/cli.py: passes args.repeat / args.warmup into the accuracy sync wrapper AND into the recorded config dict so the JSON faithfully reflects what ran.

tests/bench/test_accuracy.py: 2 new tests covering the repeat loop and warmup-iter discard.

bench-baseline-update.yml — sensible defaults

--repeat 1 (effectively, before this PR) → --repeat 2 --warmup 0. Welch's t-test minimum is now n=2 (see Fix A below); running 3 iters would put us at ~66 min, over budget. Timeout bumped 40 → 50 min for headroom.

New bench-pr.yml — auto bench comment on PRs

trigger path filter mode repeat timeout
pull_request (open/synchronize/reopen) same as bench-baseline-update accuracy 2 50 min
  • Builds pare image from PR head with GHA cache (shared across PRs and baseline-update).
  • Runs bench.run + bench.compare reports/baseline.core.json reports/_head.json.
  • Posts a sticky comment with the four-section diff via peter-evans/find-comment + create-or-update-comment (matching bench.yml's pattern).
  • Hidden <!-- pare-bench-comment --> signature keeps repeat pushes updating the same comment.
  • Concurrency cancel-in-progress: true cancels older runs when new commits push.
  • Fails CI on regression in any axis — turns the four-section gate into an actual merge requirement.

Compare-engine noise tuning (commit 334ff45)

Fix A: _STATS_MIN_ITERS from 3 → 2 in compare.py. Welch's t-test at n=2 has 1 degree of freedom — wide CI but functional. p > 0.05 on noisy cases (correct: no false positive), p < 0.05 on tight ones. Matches accuracy mode's --repeat 2.

Fix C: new --noise-floor-min-ms flag (default 5.0). When a case's baseline median is below this threshold, the relative noise-floor gate is skipped — a 0.04ms → 0.07ms = +75% delta is measurement quantization, not signal. The AVIF early-skip path produced 9 such false positives in this PR's first bench-pr run. Surfaced in the markdown header so reviewers see the gate definition.

3 new tests covering both fixes.

Empirical impact on this PR's own bench run

The first bench-pr.yml run on commit e55860b flagged 13 noise_floor_flags. After applying A + C and re-comparing the same artifact: 13 → 3 noise_floor_flags. The 9 sub-millisecond AVIF false-positives are gone. The remaining 3 are real BMP single-threaded multi-100ms cases hit by shared-CI CPU steal — a separate axis worth tackling later but not blocking.

Docs

bench/CLAUDE.md "CI integration" section updated to describe all three workflows; removed the stale "accuracy mode currently ignores --repeat" note.

Test plan

  • pytest tests/bench/ — 520 passed, 2 skipped (was 500 at PR feat(bench): three-axis CI gating — timing, compression, estimation #32 merge; +20 across this PR's additions)
  • pytest tests/bench/test_compare.py -v — 36 passing
  • ruff check + black --check clean
  • First bench-pr.yml run on this PR posted the four-section comment, confirming the auto-on-PR workflow works end-to-end
  • Re-compare with the new gate: 13 → 3 noise_floor_flags

Known follow-up

3 BMP single-threaded cases at 200ms-1s wall time may still flag on shared CI from real CPU-steal noise. Possible follow-ups (not in this PR): per-format noise-floor thresholds, or isolation runs for the BMP class.

Closes #33.

🤖 Generated with Claude Code

Closes the residual #31/#33 gap (single-sample timing in accuracy mode)
and adds an auto-on-PR bench comment workflow so future optimizer PRs
get a four-section impact report without manual workflow_dispatch.

## accuracy mode honors --repeat / --warmup

`bench/runner/modes/accuracy.py`:
- `_run_one_accuracy_case` now accepts `iteration: int`; previously
  hardcoded `iteration: 0` so multi-iter runs would have all entries
  collide on iteration=0.
- `run_accuracy` gains `repeat: int = 1, warmup: int = 0` params and a
  warmup-then-repeat loop matching quick/timing mode shape. Warmup
  iterations are discarded; measured iterations populate the result
  list with the correct iteration index.
- `run_accuracy_sync` plumbs `repeat`/`warmup` through.

`bench/runner/cli.py`: pass `args.repeat`/`args.warmup` into the
accuracy sync wrapper and into the recorded `config` dict so the JSON
faithfully records what ran.

`tests/bench/test_accuracy.py`: 2 new tests covering the repeat loop
and warmup-iter discard.

This unblocks Welch's t-test for timing on accuracy mode. With n>=3
both sides, the stats gate fires instead of the 25% noise-floor
fallback that opened #31 and #33.

## bench-baseline-update.yml — --repeat 2 --warmup 0, timeout 50 min

Accuracy mode at --repeat 1 took ~22 min CI. At --repeat 3 + warmup 1
that would be ~88 min, over the 40-min timeout. Picked --repeat 2 +
warmup 0 (≈44 min) as the practical sweet spot: Welch's t-test at n=2
is statistically weak but better than noise-floor; timeout bumped 40
to 50 min for headroom.

## New bench-pr.yml workflow

`.github/workflows/bench-pr.yml`:
- Auto-runs on pull_request open/synchronize/reopen with the same
  path filter as bench-baseline-update.yml.
- Runs `bench.run --mode accuracy --repeat 2 --warmup 0`, then
  `bench.compare reports/baseline.core.json reports/_head.json`.
- Posts a sticky comment via `<!-- pare-bench-comment -->` signature
  and peter-evans/find-comment + create-or-update-comment.
- Concurrency `cancel-in-progress: true` cancels older runs when new
  commits push, so the comment always reflects the latest commit.
- Fails CI on regression in any axis (timing/compression/estimation/
  errors), turning the bench delta into an actual merge gate.

Docs: bench/CLAUDE.md "CI integration" section refreshed to describe
all three workflows; removed the stale "accuracy mode currently
ignores --repeat" note now that it's no longer true.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 18, 2026 15:44
@amitray007 amitray007 self-assigned this May 18, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR closes the timing-noise residual in accuracy mode by making it honor --repeat/--warmup, retunes the baseline-update workflow to use --repeat 2 --warmup 0 within a 50-minute budget, and introduces a new bench-pr.yml workflow that posts a sticky four-section bench comment on every PR touching optimizer/estimator/bench paths and fails CI on regression.

Changes:

  • run_accuracy now executes warmup + repeat iterations per case, with iteration indices recorded; CLI wires --warmup/--repeat into both the run and the recorded config.
  • bench-baseline-update.yml switches to --repeat 2 --warmup 0 (timeout bumped to 50 min) so Welch's t-test gets n=2 instead of falling into the 25% noise-floor path.
  • New bench-pr.yml builds the PR image, runs accuracy mode, compares against the pinned baseline, posts/updates a sticky PR comment, and fails on regression.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
bench/runner/modes/accuracy.py Add iteration arg and warmup/repeat loop in run_accuracy/run_accuracy_sync.
bench/runner/cli.py Pass args.repeat/args.warmup into accuracy mode and into the recorded config.
tests/bench/test_accuracy.py Add tests covering repeat=3 indices and discarded warmup iterations.
.github/workflows/bench-baseline-update.yml Drop forward-compat flags to --repeat 2 --warmup 0, raise timeout to 50 min, refresh comment.
.github/workflows/bench-pr.yml New workflow: auto bench compare + sticky PR comment + regression gate.
bench/CLAUDE.md Document third workflow and updated baseline recipe.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread bench/runner/cli.py
Comment on lines +149 to +154
config = {
"warmup": args.warmup,
"repeat": args.repeat,
"stages": ["estimate", "optimize"],
}
iterations = run_accuracy_sync(cases, repeat=args.repeat, warmup=args.warmup)
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 18, 2026

Pare bench — PR #34

Run: 26046009297 · Commit: 334ff451c447c14ad47952e5daceb769baedc5db · Mode: accuracy / core · Threshold: ±10%

Bench compare: baseline.core.json → _head.json

threshold=10.0%, noise-floor=25.0%, noise-floor-min=5.0ms, α=0.05, cases compared=285, regressions=0, noise_floor_flags=1, improvements=0

Compare conditions

baseline head
file baseline.core.json _head.json
mode accuracy accuracy
isolate False False
platform linux linux
cpu_count 4 4

Timing

Per-format summary

Format Cases Median Δ% Worst Δ% Regressions Improvements Status
webp 36 -0.2% +33.0% 1 0
avif 30 +0.8% +52.5% 0 0 ~
bmp 18 -4.1% +37.4% 0 0 ~
jpeg 48 -0.5% +12.4% 0 0 ~
heic 21 -1.0% +11.5% 0 0 ~
apng 15 -2.3% +9.6% 0 0 ~
tiff 18 -0.3% +4.7% 0 0 ~
svg 18 -0.8% +3.8% 0 0 ~
gif 15 -0.8% +2.8% 0 0 ~
png 60 -0.5% +2.4% 0 0 ~
svgz 6 -0.8% +2.3% 0 0 ~
Per-case detail (285 cases)
case_id baseline head Δ% p d label
photo_noise_small_avif.avif@low 0.0ms 0.1ms +52.5% 1.000 +0.00 ~
photo_perlin_tiny_avif.avif@high 4.5ms 6.3ms +41.4% 1.000 +0.00 ~
photo_perlin_tiny_bmp.bmp@low 0.5ms 0.7ms +37.4% 1.000 +0.00 ~
path_text_on_flat_large_webp.webp@high 3044.4ms 4048.7ms +33.0% 1.000 +0.00 ⚠ noise-floor
graphic_geometric_tiny_avif.avif@low 0.0ms 0.1ms +27.2% 1.000 +0.00 ~
photo_noise_large_avif.avif@low 0.1ms 0.1ms +26.7% 1.000 +0.00 ~
path_text_on_flat_large_webp.webp@low 4028.7ms 3062.1ms -24.0% 1.000 +0.00 ~
photo_perlin_small_avif.avif@medium 0.0ms 0.1ms +23.7% 1.000 +0.00 ~
photo_perlin_tiny_avif.avif@low 0.1ms 0.0ms -18.9% 1.000 +0.00 ~
animated_translation_medium_apng.apng@high 100.0ms 81.9ms -18.1% 1.000 +0.00 ~
photo_perlin_tiny_png.png@high 14.5ms 12.0ms -17.5% 1.000 +0.00 ~
path_text_on_flat_medium_jpeg.jpeg@low 60.3ms 50.0ms -17.0% 1.000 +0.00 ~
photo_perlin_xlarge_bmp.bmp@low 12.7ms 10.6ms -16.2% 1.000 +0.00 ~
photo_perlin_xlarge_bmp.bmp@high 1012.6ms 860.8ms -15.0% 1.000 +0.00 ~
graphic_geometric_medium_avif.avif@medium 0.0ms 0.0ms +12.6% 1.000 +0.00 ~
photo_perlin_tiny_webp.webp@high 1.3ms 1.1ms -12.6% 1.000 +0.00 ~
photo_perlin_small_avif.avif@low 0.0ms 0.0ms +12.4% 1.000 +0.00 ~
graphic_geometric_tiny_jpeg.jpeg@low 2.1ms 2.4ms +12.4% 1.000 +0.00 ~
deep_color_10bit_small_heic.heic@medium 47.1ms 52.6ms +11.5% 1.000 +0.00 ~
photo_perlin_tiny_webp.webp@low 1.2ms 1.3ms +11.2% 1.000 +0.00 ~
photo_noise_medium_avif.avif@low 0.0ms 0.0ms +10.5% 1.000 +0.00 ~
photo_perlin_large_bmp.bmp@low 3.7ms 3.3ms -10.4% 1.000 +0.00 ~
photo_perlin_xlarge_bmp.bmp@medium 792.3ms 711.3ms -10.2% 1.000 +0.00 ~
animated_translation_tiny_apng.apng@high 9.3ms 10.2ms +9.6% 1.000 +0.00 ~
photo_noise_small_heic.heic@medium 63.3ms 69.2ms +9.4% 1.000 +0.00 ~
graphic_geometric_tiny_avif.avif@high 6.1ms 6.7ms +9.3% 1.000 +0.00 ~
graphic_palette_tiny_tiff.tiff@low 3.0ms 2.7ms -9.1% 1.000 +0.00 ~
animated_translation_small_apng.apng@medium 30.4ms 27.7ms -9.0% 1.000 +0.00 ~
photo_perlin_small_bmp.bmp@low 0.6ms 0.6ms -8.8% 1.000 +0.00 ~
graphic_geometric_medium_avif.avif@low 0.1ms 0.0ms -8.7% 1.000 +0.00 ~
photo_perlin_small_tiff.tiff@low 4.0ms 3.7ms -8.7% 1.000 +0.00 ~
photo_perlin_small_heic.heic@low 47.4ms 43.5ms -8.3% 1.000 +0.00 ~
graphic_geometric_tiny_png.png@high 4.4ms 4.1ms -8.2% 1.000 +0.00 ~
fat_avif_noise_xlarge.avif@low 0.1ms 0.1ms +7.7% 1.000 +0.00 ~
photo_perlin_tiny_jpeg.jpeg@high 2.5ms 2.3ms -7.6% 1.000 +0.00 ~
photo_noise_large_jpeg.jpeg@high 370.7ms 396.3ms +6.9% 1.000 +0.00 ~
animated_translation_small_apng.apng@high 28.5ms 26.5ms -6.8% 1.000 +0.00 ~
photo_noise_medium_avif.avif@high 108.7ms 116.1ms +6.8% 1.000 +0.00 ~
graphic_geometric_tiny_jpeg.jpeg@high 2.5ms 2.3ms -6.7% 1.000 +0.00 ~
fat_bmp_noise_xlarge.bmp@low 58.5ms 54.7ms -6.5% 1.000 +0.00 ~
graphic_palette_tiny_tiff.tiff@high 3.3ms 3.1ms -6.4% 1.000 +0.00 ~
photo_perlin_tiny_heic.heic@low 6.7ms 7.1ms +6.4% 1.000 +0.00 ~
photo_perlin_tiny_heic.heic@high 10.2ms 9.6ms -6.2% 1.000 +0.00 ~
photo_perlin_large_bmp.bmp@medium 224.7ms 211.5ms -5.9% 1.000 +0.00 ~
photo_noise_xlarge_avif.avif@low 0.1ms 0.1ms -5.7% 1.000 +0.00 ~
photo_noise_small_avif.avif@high 44.7ms 42.3ms -5.5% 1.000 +0.00 ~
animated_translation_medium_apng.apng@medium 101.2ms 95.9ms -5.2% 1.000 +0.00 ~
photo_perlin_small_heic.heic@medium 79.9ms 83.9ms +5.0% 1.000 +0.00 ~
photo_perlin_medium_bmp.bmp@low 1.1ms 1.1ms -4.8% 1.000 +0.00 ~
photo_noise_medium_webp.webp@high 157.5ms 150.0ms -4.8% 1.000 +0.00 ~
animated_redraw_tiny_gif.gif@high 4.7ms 4.4ms -4.7% 1.000 +0.00 ~
photo_perlin_xlarge_tiff.tiff@low 435.1ms 455.7ms +4.7% 1.000 +0.00 ~
deep_color_10bit_small_avif.avif@low 0.0ms 0.0ms +4.7% 1.000 +0.00 ~
photo_perlin_small_heic.heic@high 55.9ms 53.3ms -4.7% 1.000 +0.00 ~
photo_noise_small_heic.heic@low 28.1ms 29.4ms +4.7% 1.000 +0.00 ~
photo_perlin_small_bmp.bmp@medium 8.3ms 7.9ms -4.5% 1.000 +0.00 ~
photo_perlin_medium_heic.heic@high 261.3ms 250.6ms -4.1% 1.000 +0.00 ~
animated_translation_medium_apng.apng@low 47.0ms 45.1ms -4.1% 1.000 +0.00 ~
photo_perlin_medium_tiff.tiff@high 29.2ms 28.0ms -3.9% 1.000 +0.00 ~
vector_geometric_tiny_svg.svg@low 14.9ms 15.5ms +3.8% 1.000 +0.00 ~
fat_bmp_noise_xlarge.bmp@high 7072.5ms 6807.7ms -3.7% 1.000 +0.00 ~
path_thin_gradient_medium_webp.webp@medium 3685.6ms 3548.5ms -3.7% 1.000 +0.00 ~
photo_perlin_tiny_jpeg.jpeg@medium 2.4ms 2.3ms -3.7% 1.000 +0.00 ~
graphic_geometric_tiny_png.png@medium 4.2ms 4.0ms -3.7% 1.000 +0.00 ~
transparent_overlay_large_webp.webp@medium 924.6ms 958.1ms +3.6% 1.000 +0.00 ~
transparent_overlay_medium_png.png@low 179.7ms 173.3ms -3.6% 1.000 +0.00 ~
animated_translation_small_apng.apng@low 14.1ms 13.6ms -3.6% 1.000 +0.00 ~
animated_redraw_tiny_gif.gif@low 2.1ms 2.0ms -3.5% 1.000 +0.00 ~
photo_perlin_medium_tiff.tiff@medium 28.7ms 27.7ms -3.5% 1.000 +0.00 ~
path_thin_gradient_medium_webp.webp@low 3453.8ms 3570.5ms +3.4% 1.000 +0.00 ~
photo_perlin_tiny_heic.heic@medium 13.3ms 12.9ms -3.4% 1.000 +0.00 ~
path_thin_gradient_medium_jpeg.jpeg@medium 159.5ms 164.5ms +3.2% 1.000 +0.00 ~
path_thin_gradient_small_jpeg.jpeg@high 17.1ms 16.6ms -3.0% 1.000 +0.00 ~
photo_perlin_medium_bmp.bmp@medium 73.4ms 71.3ms -2.9% 1.000 +0.00 ~
path_thin_gradient_small_jpeg.jpeg@low 17.3ms 16.8ms -2.9% 1.000 +0.00 ~
animated_redraw_tiny_gif.gif@medium 5.1ms 5.2ms +2.8% 1.000 +0.00 ~
graphic_geometric_large_jpeg.jpeg@high 1165.6ms 1132.8ms -2.8% 1.000 +0.00 ~
photo_perlin_large_heic.heic@medium 3247.3ms 3156.9ms -2.8% 1.000 +0.00 ~
vector_geometric_tiny_svg.svg@medium 15.4ms 15.0ms -2.7% 1.000 +0.00 ~
transparent_overlay_small_png.png@low 26.2ms 25.5ms -2.7% 1.000 +0.00 ~
text_screenshot_small_png.png@high 49.5ms 48.1ms -2.7% 1.000 +0.00 ~
animated_translation_tiny_apng.apng@medium 10.3ms 10.0ms -2.5% 1.000 +0.00 ~
fat_bmp_noise_xlarge.bmp@medium 5887.7ms 5742.0ms -2.5% 1.000 +0.00 ~
photo_noise_small_webp.webp@medium 9.9ms 9.6ms -2.4% 1.000 +0.00 ~
photo_perlin_small_png.png@high 83.1ms 81.1ms -2.4% 1.000 +0.00 ~
vector_with_script_medium_svg.svg@high 232.8ms 227.3ms -2.4% 1.000 +0.00 ~
graphic_geometric_medium_png.png@low 2932.9ms 2863.4ms -2.4% 1.000 +0.00 ~
photo_noise_small_png.png@high 42.1ms 43.1ms +2.4% 1.000 +0.00 ~
path_text_on_flat_medium_jpeg.jpeg@high 51.2ms 50.0ms -2.3% 1.000 +0.00 ~
path_thin_gradient_small_jpeg.jpeg@medium 17.2ms 16.8ms -2.3% 1.000 +0.00 ~
transparent_overlay_large_webp.webp@low 983.1ms 960.2ms -2.3% 1.000 +0.00 ~
animated_translation_large_apng.apng@low 758.4ms 740.8ms -2.3% 1.000 +0.00 ~
vector_geometric_tiny_svgz.svgz@low 22.4ms 22.9ms +2.3% 1.000 +0.00 ~
photo_perlin_tiny_avif.avif@medium 4.2ms 4.3ms +2.3% 1.000 +0.00 ~
photo_perlin_tiny_bmp.bmp@high 6.9ms 7.1ms +2.3% 1.000 +0.00 ~
vector_geometric_medium_svg.svg@low 232.8ms 227.7ms -2.2% 1.000 +0.00 ~
photo_noise_xlarge_jpeg.jpeg@high 1382.6ms 1411.9ms +2.1% 1.000 +0.00 ~
path_thin_gradient_large_jpeg.jpeg@high 1370.3ms 1341.8ms -2.1% 1.000 +0.00 ~
photo_perlin_tiny_bmp.bmp@medium 6.4ms 6.5ms +2.0% 1.000 +0.00 ~
graphic_geometric_medium_jpeg.jpeg@medium 66.2ms 67.6ms +2.0% 1.000 +0.00 ~
photo_perlin_tiny_png.png@medium 22.6ms 23.0ms +2.0% 1.000 +0.00 ~
photo_noise_small_png.png@medium 51.8ms 52.8ms +2.0% 1.000 +0.00 ~
animated_translation_tiny_apng.apng@low 5.2ms 5.1ms -2.0% 1.000 +0.00 ~
deep_color_10bit_small_avif.avif@medium 0.0ms 0.0ms +2.0% 1.000 +0.00 ~
photo_perlin_medium_heic.heic@low 270.8ms 265.8ms -1.8% 1.000 +0.00 ~
path_text_on_flat_small_jpeg.jpeg@high 21.4ms 21.0ms -1.8% 1.000 +0.00 ~
text_screenshot_large_jpeg.jpeg@low 444.3ms 452.3ms +1.8% 1.000 +0.00 ~
vector_geometric_large_canvas_svg.svg@low 231.5ms 227.3ms -1.8% 1.000 +0.00 ~
photo_perlin_large_png.png@medium 2928.9ms 2876.5ms -1.8% 1.000 +0.00 ~
text_screenshot_small_png.png@medium 48.8ms 47.9ms -1.8% 1.000 +0.00 ~
path_text_on_flat_small_jpeg.jpeg@low 21.2ms 20.9ms -1.8% 1.000 +0.00 ~
vector_with_script_medium_svg.svg@low 229.4ms 225.3ms -1.7% 1.000 +0.00 ~
text_screenshot_small_png.png@low 62.5ms 61.4ms -1.7% 1.000 +0.00 ~
photo_perlin_large_heic.heic@low 2223.0ms 2185.0ms -1.7% 1.000 +0.00 ~
graphic_geometric_tiny_jpeg.jpeg@medium 2.3ms 2.3ms -1.6% 1.000 +0.00 ~
vector_geometric_large_canvas_svg.svg@medium 230.2ms 226.4ms -1.6% 1.000 +0.00 ~
vector_with_script_small_svgz.svgz@low 240.3ms 236.5ms -1.6% 1.000 +0.00 ~
photo_perlin_medium_png.png@high 535.3ms 527.1ms -1.5% 1.000 +0.00 ~
graphic_geometric_small_webp.webp@high 41.2ms 40.6ms -1.5% 1.000 +0.00 ~
photo_perlin_medium_bmp.bmp@high 81.6ms 80.5ms -1.4% 1.000 +0.00 ~
animated_redraw_small_gif.gif@low 9.1ms 9.0ms -1.4% 1.000 +0.00 ~
transparent_overlay_large_webp.webp@high 1326.8ms 1345.5ms +1.4% 1.000 +0.00 ~
graphic_geometric_medium_jpeg.jpeg@high 67.3ms 66.4ms -1.4% 1.000 +0.00 ~
photo_noise_xlarge_png.png@high 1823.9ms 1798.4ms -1.4% 1.000 +0.00 ~
path_text_on_flat_medium_jpeg.jpeg@medium 51.1ms 50.4ms -1.4% 1.000 +0.00 ~
animated_redraw_small_gif.gif@medium 38.4ms 37.9ms -1.3% 1.000 +0.00 ~
photo_perlin_small_png.png@low 88.5ms 87.3ms -1.3% 1.000 +0.00 ~
photo_perlin_small_jpeg.jpeg@medium 34.4ms 33.9ms -1.3% 1.000 +0.00 ~
photo_perlin_large_heic.heic@high 1922.9ms 1898.6ms -1.3% 1.000 +0.00 ~
path_text_on_flat_medium_png.png@low 1700.4ms 1678.9ms -1.3% 1.000 +0.00 ~
photo_perlin_small_bmp.bmp@high 8.9ms 8.7ms -1.2% 1.000 +0.00 ~
path_thin_gradient_medium_jpeg.jpeg@low 162.0ms 164.0ms +1.2% 1.000 +0.00 ~
transparent_overlay_medium_webp.webp@high 185.7ms 183.4ms -1.2% 1.000 +0.00 ~
graphic_geometric_medium_jpeg.jpeg@low 66.7ms 65.9ms -1.2% 1.000 +0.00 ~
graphic_palette_medium_gif.gif@low 21.3ms 21.0ms -1.2% 1.000 +0.00 ~
graphic_palette_tiny_tiff.tiff@medium 3.2ms 3.3ms +1.2% 1.000 +0.00 ~
graphic_geometric_small_webp.webp@medium 40.8ms 41.3ms +1.2% 1.000 +0.00 ~
vector_geometric_medium_svg.svg@high 232.7ms 229.9ms -1.2% 1.000 +0.00 ~
path_text_on_flat_large_jpeg.jpeg@high 610.4ms 603.2ms -1.2% 1.000 +0.00 ~
graphic_geometric_large_jpeg.jpeg@medium 1146.1ms 1132.6ms -1.2% 1.000 +0.00 ~
photo_noise_large_webp.webp@high 844.2ms 834.4ms -1.2% 1.000 +0.00 ~
photo_perlin_xlarge_tiff.tiff@high 450.6ms 445.5ms -1.1% 1.000 +0.00 ~
fat_png_noise_xlarge.png@low 3975.2ms 3930.7ms -1.1% 1.000 +0.00 ~
deep_color_10bit_small_heic.heic@low 22.8ms 23.0ms +1.1% 1.000 +0.00 ~
transparent_overlay_small_png.png@medium 24.1ms 24.3ms +1.0% 1.000 +0.00 ~
graphic_geometric_large_jpeg.jpeg@low 1142.9ms 1131.1ms -1.0% 1.000 +0.00 ~
photo_perlin_tiny_jpeg.jpeg@low 2.4ms 2.4ms +1.0% 1.000 +0.00 ~
photo_perlin_xlarge_heic.heic@high 8080.2ms 7999.4ms -1.0% 1.000 +0.00 ~
photo_perlin_medium_heic.heic@medium 424.2ms 420.0ms -1.0% 1.000 +0.00 ~
graphic_geometric_small_webp.webp@low 40.8ms 40.4ms -1.0% 1.000 +0.00 ~
photo_perlin_medium_tiff.tiff@low 26.6ms 26.8ms +1.0% 1.000 +0.00 ~
vector_geometric_tiny_svgz.svgz@medium 22.5ms 22.3ms -1.0% 1.000 +0.00 ~
text_screenshot_large_png.png@high 7827.6ms 7753.3ms -0.9% 1.000 +0.00 ~
path_text_on_flat_large_png.png@low 6652.1ms 6590.1ms -0.9% 1.000 +0.00 ~
photo_perlin_tiny_webp.webp@medium 1.2ms 1.2ms +0.9% 1.000 +0.00 ~
photo_perlin_small_jpeg.jpeg@low 34.4ms 34.1ms -0.9% 1.000 +0.00 ~
vector_with_script_small_svg.svg@low 22.4ms 22.2ms -0.9% 1.000 +0.00 ~
photo_noise_xlarge_jpeg.jpeg@medium 1374.5ms 1387.1ms +0.9% 1.000 +0.00 ~
photo_perlin_large_tiff.tiff@high 186.0ms 184.3ms -0.9% 1.000 +0.00 ~
transparent_overlay_medium_png.png@high 207.1ms 205.2ms -0.9% 1.000 +0.00 ~
photo_noise_small_avif.avif@medium 45.5ms 45.9ms +0.9% 1.000 +0.00 ~
vector_with_script_small_svgz.svgz@medium 239.9ms 237.9ms -0.9% 1.000 +0.00 ~
text_screenshot_medium_jpeg.jpeg@medium 48.7ms 48.3ms -0.8% 1.000 +0.00 ~
graphic_geometric_medium_avif.avif@high 2744.4ms 2721.3ms -0.8% 1.000 +0.00 ~
graphic_palette_medium_gif.gif@high 59.3ms 58.8ms -0.8% 1.000 +0.00 ~
photo_noise_large_webp.webp@medium 874.7ms 867.4ms -0.8% 1.000 +0.00 ~
path_thin_gradient_large_png.png@high 16990.3ms 16849.7ms -0.8% 1.000 +0.00 ~
text_screenshot_large_png.png@low 7946.2ms 7880.8ms -0.8% 1.000 +0.00 ~
text_screenshot_medium_png.png@medium 650.5ms 655.8ms +0.8% 1.000 +0.00 ~
vector_geometric_small_svg.svg@high 22.3ms 22.1ms -0.8% 1.000 +0.00 ~
vector_with_script_small_svg.svg@high 22.3ms 22.2ms -0.8% 1.000 +0.00 ~
animated_redraw_small_gif.gif@high 39.1ms 38.8ms -0.8% 1.000 +0.00 ~
graphic_palette_medium_gif.gif@medium 59.0ms 58.5ms -0.8% 1.000 +0.00 ~
vector_geometric_tiny_svgz.svgz@high 22.6ms 22.5ms -0.8% 1.000 +0.00 ~
photo_perlin_xlarge_heic.heic@medium 13946.4ms 13833.9ms -0.8% 1.000 +0.00 ~
photo_perlin_small_jpeg.jpeg@high 34.0ms 33.7ms -0.8% 1.000 +0.00 ~
photo_noise_xlarge_png.png@low 1766.4ms 1752.6ms -0.8% 1.000 +0.00 ~
photo_noise_xlarge_webp.webp@high 1411.0ms 1400.1ms -0.8% 1.000 +0.00 ~
photo_noise_small_png.png@low 51.7ms 52.1ms +0.8% 1.000 +0.00 ~
transparent_overlay_medium_png.png@medium 168.7ms 167.4ms -0.8% 1.000 +0.00 ~
fat_png_noise_xlarge.png@medium 4010.2ms 3979.6ms -0.8% 1.000 +0.00 ~
photo_noise_medium_webp.webp@medium 155.8ms 157.0ms +0.8% 1.000 +0.00 ~
transparent_overlay_medium_webp.webp@medium 124.8ms 123.9ms -0.7% 1.000 +0.00 ~
photo_perlin_small_avif.avif@high 58.1ms 58.5ms +0.7% 1.000 +0.00 ~
path_thin_gradient_large_jpeg.jpeg@medium 1345.2ms 1354.5ms +0.7% 1.000 +0.00 ~
animated_translation_xlarge_apng.apng@low 5644.3ms 5605.8ms -0.7% 1.000 +0.00 ~
photo_noise_large_avif.avif@high 1328.3ms 1319.3ms -0.7% 1.000 +0.00 ~
photo_noise_xlarge_avif.avif@high 3532.2ms 3508.6ms -0.7% 1.000 +0.00 ~
vector_geometric_small_svg.svg@medium 22.2ms 22.0ms -0.7% 1.000 +0.00 ~
text_screenshot_large_jpeg.jpeg@high 445.1ms 442.1ms -0.7% 1.000 +0.00 ~
photo_perlin_small_webp.webp@high 54.5ms 54.1ms -0.7% 1.000 +0.00 ~
path_thin_gradient_large_png.png@low 21900.6ms 21759.0ms -0.6% 1.000 +0.00 ~
photo_noise_small_heic.heic@high 68.4ms 67.9ms -0.6% 1.000 +0.00 ~
graphic_palette_small_png.png@low 141.6ms 140.7ms -0.6% 1.000 +0.00 ~
deep_color_10bit_small_avif.avif@high 45.7ms 46.0ms +0.6% 1.000 +0.00 ~
photo_perlin_medium_jpeg.jpeg@high 143.9ms 144.8ms +0.6% 1.000 +0.00 ~
photo_noise_small_webp.webp@high 9.3ms 9.3ms +0.6% 1.000 +0.00 ~
text_screenshot_medium_png.png@low 656.1ms 652.0ms -0.6% 1.000 +0.00 ~
photo_perlin_tiny_png.png@low 17.8ms 17.7ms -0.6% 1.000 +0.00 ~
vector_geometric_large_canvas_svg.svg@high 236.3ms 237.7ms +0.6% 1.000 +0.00 ~
animated_redraw_xlarge_gif.gif@low 729.8ms 725.4ms -0.6% 1.000 +0.00 ~
transparent_overlay_small_png.png@high 28.2ms 28.3ms +0.6% 1.000 +0.00 ~
photo_perlin_large_bmp.bmp@high 260.1ms 261.7ms +0.6% 1.000 +0.00 ~
photo_noise_small_webp.webp@low 10.0ms 10.0ms +0.6% 1.000 +0.00 ~
animated_redraw_large_gif.gif@high 857.6ms 852.9ms -0.5% 1.000 +0.00 ~
text_screenshot_medium_jpeg.jpeg@low 48.1ms 48.4ms +0.5% 1.000 +0.00 ~
path_thin_gradient_medium_png.png@high 11380.7ms 11322.0ms -0.5% 1.000 +0.00 ~
vector_geometric_tiny_svg.svg@high 15.5ms 15.6ms +0.5% 1.000 +0.00 ~
photo_noise_xlarge_webp.webp@low 1489.7ms 1497.1ms +0.5% 1.000 +0.00 ~
vector_with_script_small_svgz.svgz@high 242.4ms 241.2ms -0.5% 1.000 +0.00 ~
graphic_palette_small_png.png@medium 227.7ms 226.6ms -0.5% 1.000 +0.00 ~
path_text_on_flat_large_png.png@medium 6641.1ms 6609.0ms -0.5% 1.000 +0.00 ~
photo_perlin_large_png.png@high 3216.8ms 3201.3ms -0.5% 1.000 +0.00 ~
path_thin_gradient_medium_png.png@low 15304.8ms 15232.3ms -0.5% 1.000 +0.00 ~
graphic_geometric_tiny_png.png@low 4.0ms 4.0ms +0.5% 1.000 +0.00 ~
graphic_geometric_medium_png.png@medium 5759.5ms 5734.3ms -0.4% 1.000 +0.00 ~
photo_perlin_small_png.png@medium 87.7ms 87.3ms -0.4% 1.000 +0.00 ~
fat_tiff_perlin_xlarge.tiff@low 1766.2ms 1758.8ms -0.4% 1.000 +0.00 ~
animated_translation_xlarge_apng.apng@medium 5636.2ms 5613.1ms -0.4% 1.000 +0.00 ~
animated_redraw_large_gif.gif@medium 866.2ms 869.7ms +0.4% 1.000 +0.00 ~
transparent_overlay_medium_webp.webp@low 124.5ms 125.0ms +0.4% 1.000 +0.00 ~
text_screenshot_large_png.png@medium 7942.5ms 7911.3ms -0.4% 1.000 +0.00 ~
path_text_on_flat_large_jpeg.jpeg@medium 604.0ms 601.6ms -0.4% 1.000 +0.00 ~
photo_perlin_large_tiff.tiff@low 179.5ms 178.8ms -0.4% 1.000 +0.00 ~
animated_redraw_large_gif.gif@low 143.5ms 143.0ms -0.4% 1.000 +0.00 ~
photo_noise_large_jpeg.jpeg@medium 372.6ms 371.2ms -0.4% 1.000 +0.00 ~
photo_perlin_medium_jpeg.jpeg@medium 144.2ms 144.7ms +0.4% 1.000 +0.00 ~
fat_avif_noise_xlarge.avif@high 9660.2ms 9626.4ms -0.4% 1.000 +0.00 ~
animated_translation_xlarge_apng.apng@high 5545.5ms 5564.9ms +0.3% 1.000 +0.00 ~
path_text_on_flat_medium_webp.webp@high 76.2ms 75.9ms -0.3% 1.000 +0.00 ~
photo_perlin_small_webp.webp@low 60.6ms 60.4ms -0.3% 1.000 +0.00 ~
vector_geometric_small_svg.svg@low 22.2ms 22.2ms +0.3% 1.000 +0.00 ~
photo_perlin_large_png.png@low 2881.7ms 2891.1ms +0.3% 1.000 +0.00 ~
photo_noise_medium_webp.webp@low 160.3ms 159.8ms -0.3% 1.000 +0.00 ~
photo_perlin_medium_jpeg.jpeg@low 145.3ms 144.9ms -0.3% 1.000 +0.00 ~
path_text_on_flat_medium_png.png@medium 1689.5ms 1684.1ms -0.3% 1.000 +0.00 ~
text_screenshot_medium_png.png@high 655.3ms 653.2ms -0.3% 1.000 +0.00 ~
photo_noise_xlarge_webp.webp@medium 1464.6ms 1469.0ms +0.3% 1.000 +0.00 ~
path_thin_gradient_large_jpeg.jpeg@low 1356.2ms 1352.3ms -0.3% 1.000 +0.00 ~
path_thin_gradient_medium_png.png@medium 16224.6ms 16271.2ms +0.3% 1.000 +0.00 ~
path_thin_gradient_medium_jpeg.jpeg@high 164.3ms 164.8ms +0.3% 1.000 +0.00 ~
photo_noise_large_jpeg.jpeg@low 371.4ms 370.3ms -0.3% 1.000 +0.00 ~
photo_perlin_small_webp.webp@medium 57.3ms 57.2ms -0.3% 1.000 +0.00 ~
photo_perlin_xlarge_heic.heic@low 9449.1ms 9423.5ms -0.3% 1.000 +0.00 ~
animated_translation_large_apng.apng@medium 745.9ms 744.0ms -0.3% 1.000 +0.00 ~
photo_noise_medium_avif.avif@medium 123.0ms 123.3ms +0.2% 1.000 +0.00 ~
photo_noise_large_avif.avif@medium 1687.4ms 1683.3ms -0.2% 1.000 +0.00 ~
vector_with_script_small_svg.svg@medium 22.1ms 22.1ms +0.2% 1.000 +0.00 ~
photo_noise_large_webp.webp@low 905.1ms 907.1ms +0.2% 1.000 +0.00 ~
text_screenshot_medium_jpeg.jpeg@high 48.0ms 47.9ms -0.2% 1.000 +0.00 ~
transparent_overlay_large_png.png@medium 1189.9ms 1187.4ms -0.2% 1.000 +0.00 ~
fat_tiff_perlin_xlarge.tiff@high 1804.6ms 1808.1ms +0.2% 1.000 +0.00 ~
path_text_on_flat_medium_webp.webp@medium 77.8ms 77.9ms +0.2% 1.000 +0.00 ~
photo_perlin_medium_png.png@medium 519.7ms 520.6ms +0.2% 1.000 +0.00 ~
photo_perlin_xlarge_tiff.tiff@medium 447.5ms 448.2ms +0.2% 1.000 +0.00 ~
photo_noise_xlarge_png.png@medium 1748.5ms 1746.0ms -0.1% 1.000 +0.00 ~
path_thin_gradient_large_png.png@medium 24237.0ms 24270.9ms +0.1% 1.000 +0.00 ~
path_text_on_flat_large_webp.webp@medium 4015.0ms 4020.6ms +0.1% 1.000 +0.00 ~
vector_geometric_medium_svg.svg@medium 226.7ms 226.4ms -0.1% 1.000 +0.00 ~
animated_redraw_xlarge_gif.gif@medium 4519.1ms 4513.1ms -0.1% 1.000 +0.00 ~
graphic_geometric_medium_png.png@high 5749.5ms 5756.9ms +0.1% 1.000 +0.00 ~
path_text_on_flat_small_jpeg.jpeg@medium 21.0ms 21.0ms -0.1% 1.000 +0.00 ~
graphic_palette_small_png.png@high 242.5ms 242.2ms -0.1% 1.000 +0.00 ~
photo_perlin_large_tiff.tiff@medium 185.0ms 184.7ms -0.1% 1.000 +0.00 ~
text_screenshot_large_jpeg.jpeg@medium 444.3ms 444.8ms +0.1% 1.000 +0.00 ~
path_text_on_flat_large_png.png@high 6653.5ms 6646.5ms -0.1% 1.000 +0.00 ~
path_thin_gradient_medium_webp.webp@high 3501.4ms 3505.0ms +0.1% 1.000 +0.00 ~
photo_perlin_medium_png.png@low 522.4ms 521.9ms -0.1% 1.000 +0.00 ~
transparent_overlay_large_png.png@high 1660.8ms 1659.2ms -0.1% 1.000 +0.00 ~
graphic_geometric_tiny_avif.avif@medium 5.6ms 5.6ms +0.1% 1.000 +0.00 ~
transparent_overlay_large_png.png@low 906.6ms 907.4ms +0.1% 1.000 +0.00 ~
fat_avif_noise_xlarge.avif@medium 12867.7ms 12855.9ms -0.1% 1.000 +0.00 ~
vector_with_script_medium_svg.svg@medium 226.2ms 226.4ms +0.1% 1.000 +0.00 ~
path_text_on_flat_medium_png.png@high 1647.7ms 1646.4ms -0.1% 1.000 +0.00 ~
photo_noise_xlarge_avif.avif@medium 4457.2ms 4454.0ms -0.1% 1.000 +0.00 ~
photo_perlin_small_tiff.tiff@medium 4.1ms 4.1ms +0.1% 1.000 +0.00 ~
path_text_on_flat_medium_webp.webp@low 79.1ms 79.0ms -0.1% 1.000 +0.00 ~
animated_translation_large_apng.apng@high 745.2ms 744.7ms -0.1% 1.000 +0.00 ~
fat_png_noise_xlarge.png@high 3572.0ms 3573.7ms +0.0% 1.000 +0.00 ~
photo_noise_xlarge_jpeg.jpeg@low 1381.3ms 1381.9ms +0.0% 1.000 +0.00 ~
path_text_on_flat_large_jpeg.jpeg@low 605.6ms 605.3ms -0.0% 1.000 +0.00 ~
photo_perlin_small_tiff.tiff@high 4.2ms 4.2ms -0.0% 1.000 +0.00 ~
animated_redraw_xlarge_gif.gif@high 4470.7ms 4471.8ms +0.0% 1.000 +0.00 ~
fat_tiff_perlin_xlarge.tiff@medium 1810.7ms 1811.0ms +0.0% 1.000 +0.00 ~
deep_color_10bit_small_heic.heic@high 31.4ms 31.4ms +0.0% 1.000 +0.00 ~

Compression

reduction_threshold=3.0pp, size_threshold=5.0%

Format Cases Method changes Median Δreduction (pp) Worst Δpp Size regressions Status
apng 15 0 +0.0 +0.0 0 ~
avif 30 0 +0.0 +0.0 0 ~
bmp 18 0 +0.0 +0.0 0 ~
gif 15 0 +0.0 +0.0 0 ~
heic 21 0 +0.0 +0.0 0 ~
jpeg 48 0 +0.0 +0.0 0 ~
png 60 0 +0.0 +0.0 0 ~
svg 18 0 +0.0 +0.0 0 ~
svgz 6 0 +0.0 +0.0 0 ~
tiff 18 0 +0.0 +0.0 0 ~
webp 36 0 +0.0 +0.0 0 ~

Estimation

estimation_threshold=10.0pp

Format×Path Cases Median Δerror (pp) Worst Δpp Path shifts Status
apng × exact 15 +0.0 +0.0 0 ~
avif × direct_encode_sample 9 +0.0 +0.0 0 ~
avif × exact 21 +0.0 +0.0 0 ~
bmp × exact 9 +0.0 +0.0 0 ~
bmp × generic_fallback_sample 9 +0.0 +0.0 0 ~
gif × exact 15 +0.0 +0.0 0 ~
heic × direct_encode_sample 6 +0.0 +0.0 0 ~
heic × exact 15 +0.0 +0.0 0 ~
jpeg × direct_encode_sample 18 +0.0 +0.0 0 ~
jpeg × exact 30 +0.0 +0.0 0 ~
png × direct_encode_sample 21 +0.0 +0.0 0 ~
png × exact 39 +0.0 +0.0 0 ~
svg × exact 18 +0.0 +0.0 0 ~
svgz × exact 6 +0.0 +0.0 0 ~
tiff × direct_encode_sample 9 +0.0 +0.0 0 ~
tiff × exact 9 +0.0 +0.0 0 ~
webp × direct_encode_sample 12 +0.0 +0.0 0 ~
webp × exact 24 +0.0 +0.0 0 ~

Auto-posted by .github/workflows/bench-pr.yml. Re-runs on every push; this comment updates in place.

Two compare-engine refinements driven by the first bench-pr.yml run on
this PR (#34): the timing gate fired on noise but no real regressions,
so tune the gate.

## Fix A — _STATS_MIN_ITERS 3 → 2

Welch's t-test at n=2 has 1 degree of freedom — wide CI, statistically
weak. But it correctly returns p>0.05 for noisy cases (no false
positive) and clears for tight ones. Better than the dumb 25% threshold
in both directions. Matches accuracy mode's natural rhythm (the PR
ships --repeat 2 for that mode given timeout budget).

## Fix C — noise-floor gate skips cases below absolute-ms floor

Cases with baseline median below 5 ms (configurable via the new
--noise-floor-min-ms flag) are not flagged by the relative gate. A
0.04 ms → 0.07 ms case is measurement quantization, not signal —
the AVIF "already-optimized" early-skip path produced these at high
volume in the first bench-pr run on #34 (9 of 13 flagged cases were
sub-0.1 ms).

Surfaced in the markdown header alongside threshold-pct and
noise-floor-pct so PR comment readers see the gate definition.

3 new tests:
- test_noise_floor_skipped_below_min_ms: 0.5→0.7ms case (+40%) doesn't flag
- test_noise_floor_fires_at_or_above_min_ms: 100→140ms case (+40%) still flags
- test_stats_engages_at_n_equals_2: 2 iters each side routes through stats path

## Empirical impact on PR #34's own bench run

Re-running compare against the artifact from run 26044094317:
- Before: 13 noise_floor_flags (9 sub-0.1ms AVIF + 4 BMP CPU-steal)
- After:   3 noise_floor_flags (sub-ms quantization gone; CPU-steal remains)

The 3 surviving flags are real shared-CI noise on single-threaded
multi-100ms BMP cases — a separate axis (per-format noise floor or
CPU-steal isolation) that's out of scope here.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

✅ Test & Coverage Report

Status: PASS | Tests: 1306 passed, 0 failed, 0 errors (1306 total)

🟢 Overall Coverage: 95.0%

Module Statements Covered Missing Coverage
config.py 50 50 0 100.0%
estimation 1252 1144 108 91.4%
exceptions.py 40 40 0 100.0%
optimizers 866 830 36 95.8%
routers 226 220 6 97.3%
schemas.py 68 68 0 100.0%
security 166 165 1 99.4%
storage 31 31 0 100.0%
utils 409 404 5 98.8%

Report generated at 2026-05-18 16:23:01 UTC from commit 334ff45

@amitray007 amitray007 merged commit 6246632 into main May 18, 2026
2 of 3 checks passed
@amitray007 amitray007 deleted the bench-hardening-v2 branch May 18, 2026 17:06
amitray007 added a commit that referenced this pull request May 18, 2026
…bench-baseline]

The previous baseline.core.json was generated at --repeat 1 (pre-PR #34's
accuracy-mode --repeat support). After #34 merged, the auto bench-baseline-
update workflow ran the candidate at --repeat 2 but couldn't promote because
the n=1 vs n=2 mismatch routes the compare to the noise-floor path, where
single-threaded BMP cases on shared CI tripped 2 false-positive flags
(opened as drift issue #35).

Adopt the candidate from CI run 26048342324 directly as the pinned baseline
so future comparisons are n=2 vs n=2 and engage Welch's t-test properly,
escaping the bootstrap cycle. [skip bench-baseline] guards against the
workflow firing on this commit and re-opening another drift issue.

Closes #35.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bench: baseline drift detected on main@9e72ef433178240265e1af52a0826b62eb7ff997

2 participants