feat(bench): accuracy --repeat support + auto-on-PR bench workflow by amitray007 · Pull Request #34 · amitray007/pare

amitray007 · 2026-05-18T15:44:31Z

Summary

Three complementary bench-CI improvements, bundled:

accuracy mode honors --repeat / --warmup — closes the residual gap from bench: baseline drift detected on main@56f9595ebcae66c4be93b3dcaab4a882b5998772 #31/bench: baseline drift detected on main@9e72ef433178240265e1af52a0826b62eb7ff997 #33 where timing comparisons in accuracy mode fell through to the 25% noise-floor path and produced false-positive drift issues on shared CI.
New bench-pr.yml workflow — auto-posts a sticky four-section bench comment (Timing / Compression / Estimation / Errors) on every PR that touches optimizer/estimator/bench paths. Fails CI on regression in any axis.
Compare-engine noise tuning (added after the first bench-pr run on this PR exposed a calibration gap) — drops the stats-path minimum from 3 to 2 iterations (matching accuracy mode's natural rhythm) and adds an absolute-ms floor on the noise-floor gate so sub-millisecond measurement quantization stops flagging.

Together: future optimizer PRs land with an automatic impact report; baseline-update no longer fires drift issues on pure scheduler noise; the noise-floor fallback no longer over-fires on AVIF early-skip cases.

What changed

accuracy mode `--repeat` (closes the timing-noise residual)

bench/runner/modes/accuracy.py:

_run_one_accuracy_case(case, iteration=0) — accepts the iteration index instead of hardcoding 0.
run_accuracy(cases, repeat=1, warmup=0) — new params; warmup-then-repeat loop matching quick/timing mode. Warmup results discarded.
run_accuracy_sync plumbs both params.

bench/runner/cli.py: passes args.repeat / args.warmup into the accuracy sync wrapper AND into the recorded config dict so the JSON faithfully reflects what ran.

tests/bench/test_accuracy.py: 2 new tests covering the repeat loop and warmup-iter discard.

bench-baseline-update.yml — sensible defaults

--repeat 1 (effectively, before this PR) → --repeat 2 --warmup 0. Welch's t-test minimum is now n=2 (see Fix A below); running 3 iters would put us at ~66 min, over budget. Timeout bumped 40 → 50 min for headroom.

New `bench-pr.yml` — auto bench comment on PRs

trigger	path filter	mode	repeat	timeout
`pull_request` (open/synchronize/reopen)	same as bench-baseline-update	accuracy	2	50 min

Builds pare image from PR head with GHA cache (shared across PRs and baseline-update).
Runs bench.run + bench.compare reports/baseline.core.json reports/_head.json.
Posts a sticky comment with the four-section diff via peter-evans/find-comment + create-or-update-comment (matching bench.yml's pattern).
Hidden  signature keeps repeat pushes updating the same comment.
Concurrency cancel-in-progress: true cancels older runs when new commits push.
Fails CI on regression in any axis — turns the four-section gate into an actual merge requirement.

Compare-engine noise tuning (commit `334ff45`)

Fix A: _STATS_MIN_ITERS from 3 → 2 in compare.py. Welch's t-test at n=2 has 1 degree of freedom — wide CI but functional. p > 0.05 on noisy cases (correct: no false positive), p < 0.05 on tight ones. Matches accuracy mode's --repeat 2.

Fix C: new --noise-floor-min-ms flag (default 5.0). When a case's baseline median is below this threshold, the relative noise-floor gate is skipped — a 0.04ms → 0.07ms = +75% delta is measurement quantization, not signal. The AVIF early-skip path produced 9 such false positives in this PR's first bench-pr run. Surfaced in the markdown header so reviewers see the gate definition.

3 new tests covering both fixes.

Empirical impact on this PR's own bench run

The first bench-pr.yml run on commit e55860b flagged 13 noise_floor_flags. After applying A + C and re-comparing the same artifact: 13 → 3 noise_floor_flags. The 9 sub-millisecond AVIF false-positives are gone. The remaining 3 are real BMP single-threaded multi-100ms cases hit by shared-CI CPU steal — a separate axis worth tackling later but not blocking.

Docs

bench/CLAUDE.md "CI integration" section updated to describe all three workflows; removed the stale "accuracy mode currently ignores --repeat" note.

Test plan

pytest tests/bench/ — 520 passed, 2 skipped (was 500 at PR feat(bench): three-axis CI gating — timing, compression, estimation #32 merge; +20 across this PR's additions)
pytest tests/bench/test_compare.py -v — 36 passing
ruff check + black --check clean
First bench-pr.yml run on this PR posted the four-section comment, confirming the auto-on-PR workflow works end-to-end
Re-compare with the new gate: 13 → 3 noise_floor_flags

Known follow-up

3 BMP single-threaded cases at 200ms-1s wall time may still flag on shared CI from real CPU-steal noise. Possible follow-ups (not in this PR): per-format noise-floor thresholds, or isolation runs for the BMP class.

Closes #33.

🤖 Generated with Claude Code

Closes the residual #31/#33 gap (single-sample timing in accuracy mode) and adds an auto-on-PR bench comment workflow so future optimizer PRs get a four-section impact report without manual workflow_dispatch. ## accuracy mode honors --repeat / --warmup `bench/runner/modes/accuracy.py`: - `_run_one_accuracy_case` now accepts `iteration: int`; previously hardcoded `iteration: 0` so multi-iter runs would have all entries collide on iteration=0. - `run_accuracy` gains `repeat: int = 1, warmup: int = 0` params and a warmup-then-repeat loop matching quick/timing mode shape. Warmup iterations are discarded; measured iterations populate the result list with the correct iteration index. - `run_accuracy_sync` plumbs `repeat`/`warmup` through. `bench/runner/cli.py`: pass `args.repeat`/`args.warmup` into the accuracy sync wrapper and into the recorded `config` dict so the JSON faithfully records what ran. `tests/bench/test_accuracy.py`: 2 new tests covering the repeat loop and warmup-iter discard. This unblocks Welch's t-test for timing on accuracy mode. With n>=3 both sides, the stats gate fires instead of the 25% noise-floor fallback that opened #31 and #33. ## bench-baseline-update.yml — --repeat 2 --warmup 0, timeout 50 min Accuracy mode at --repeat 1 took ~22 min CI. At --repeat 3 + warmup 1 that would be ~88 min, over the 40-min timeout. Picked --repeat 2 + warmup 0 (≈44 min) as the practical sweet spot: Welch's t-test at n=2 is statistically weak but better than noise-floor; timeout bumped 40 to 50 min for headroom. ## New bench-pr.yml workflow `.github/workflows/bench-pr.yml`: - Auto-runs on pull_request open/synchronize/reopen with the same path filter as bench-baseline-update.yml. - Runs `bench.run --mode accuracy --repeat 2 --warmup 0`, then `bench.compare reports/baseline.core.json reports/_head.json`. - Posts a sticky comment via `` signature and peter-evans/find-comment + create-or-update-comment. - Concurrency `cancel-in-progress: true` cancels older runs when new commits push, so the comment always reflects the latest commit. - Fails CI on regression in any axis (timing/compression/estimation/ errors), turning the bench delta into an actual merge gate. Docs: bench/CLAUDE.md "CI integration" section refreshed to describe all three workflows; removed the stale "accuracy mode currently ignores --repeat" note now that it's no longer true. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

This PR closes the timing-noise residual in accuracy mode by making it honor --repeat/--warmup, retunes the baseline-update workflow to use --repeat 2 --warmup 0 within a 50-minute budget, and introduces a new bench-pr.yml workflow that posts a sticky four-section bench comment on every PR touching optimizer/estimator/bench paths and fails CI on regression.

Changes:

run_accuracy now executes warmup + repeat iterations per case, with iteration indices recorded; CLI wires --warmup/--repeat into both the run and the recorded config.
bench-baseline-update.yml switches to --repeat 2 --warmup 0 (timeout bumped to 50 min) so Welch's t-test gets n=2 instead of falling into the 25% noise-floor path.
New bench-pr.yml builds the PR image, runs accuracy mode, compares against the pinned baseline, posts/updates a sticky PR comment, and fails on regression.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
bench/runner/modes/accuracy.py	Add `iteration` arg and warmup/repeat loop in `run_accuracy`/`run_accuracy_sync`.
bench/runner/cli.py	Pass `args.repeat`/`args.warmup` into accuracy mode and into the recorded config.
tests/bench/test_accuracy.py	Add tests covering `repeat=3` indices and discarded warmup iterations.
.github/workflows/bench-baseline-update.yml	Drop forward-compat flags to `--repeat 2 --warmup 0`, raise timeout to 50 min, refresh comment.
.github/workflows/bench-pr.yml	New workflow: auto bench compare + sticky PR comment + regression gate.
bench/CLAUDE.md	Document third workflow and updated baseline recipe.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        config = {
+            "warmup": args.warmup,
+            "repeat": args.repeat,
+            "stages": ["estimate", "optimize"],
+        }
+        iterations = run_accuracy_sync(cases, repeat=args.repeat, warmup=args.warmup)


github-actions · 2026-05-18T16:10:33Z

Pare bench — PR #34

Run: 26046009297 · Commit: 334ff451c447c14ad47952e5daceb769baedc5db · Mode: accuracy / core · Threshold: ±10%

Bench compare: baseline.core.json → _head.json

threshold=10.0%, noise-floor=25.0%, noise-floor-min=5.0ms, α=0.05, cases compared=285, regressions=0, noise_floor_flags=1, improvements=0

Compare conditions

	baseline	head
file	`baseline.core.json`	`_head.json`
mode	`accuracy`	`accuracy`
isolate	`False`	`False`
platform	`linux`	`linux`
cpu_count	`4`	`4`

Timing

Per-format summary

Format	Cases	Median Δ%	Worst Δ%	Regressions	Status
`webp`	36	-0.2%	+33.0%	1	⚠
`avif`	30	+0.8%	+52.5%	0	~
`bmp`	18	-4.1%	+37.4%	0	~
`jpeg`	48	-0.5%	+12.4%	0	~
`heic`	21	-1.0%	+11.5%	0	~
`apng`	15	-2.3%	+9.6%	0	~
`tiff`	18	-0.3%	+4.7%	0	~
`svg`	18	-0.8%	+3.8%	0	~
`gif`	15	-0.8%	+2.8%	0	~
`png`	60	-0.5%	+2.4%	0	~
`svgz`	6	-0.8%	+2.3%	0	~

Per-case detail (285 cases)

case_id	baseline	head	Δ%	p	d	label
`photo_noise_small_avif.avif@low`	0.0ms	0.1ms	+52.5%	1.000	+0.00	~
`photo_perlin_tiny_avif.avif@high`	4.5ms	6.3ms	+41.4%	1.000	+0.00	~
`photo_perlin_tiny_bmp.bmp@low`	0.5ms	0.7ms	+37.4%	1.000	+0.00	~
`path_text_on_flat_large_webp.webp@high`	3044.4ms	4048.7ms	+33.0%	1.000	+0.00	⚠ noise-floor
`graphic_geometric_tiny_avif.avif@low`	0.0ms	0.1ms	+27.2%	1.000	+0.00	~
`photo_noise_large_avif.avif@low`	0.1ms	0.1ms	+26.7%	1.000	+0.00	~
`path_text_on_flat_large_webp.webp@low`	4028.7ms	3062.1ms	-24.0%	1.000	+0.00	~
`photo_perlin_small_avif.avif@medium`	0.0ms	0.1ms	+23.7%	1.000	+0.00	~
`photo_perlin_tiny_avif.avif@low`	0.1ms	0.0ms	-18.9%	1.000	+0.00	~
`animated_translation_medium_apng.apng@high`	100.0ms	81.9ms	-18.1%	1.000	+0.00	~
`photo_perlin_tiny_png.png@high`	14.5ms	12.0ms	-17.5%	1.000	+0.00	~
`path_text_on_flat_medium_jpeg.jpeg@low`	60.3ms	50.0ms	-17.0%	1.000	+0.00	~
`photo_perlin_xlarge_bmp.bmp@low`	12.7ms	10.6ms	-16.2%	1.000	+0.00	~
`photo_perlin_xlarge_bmp.bmp@high`	1012.6ms	860.8ms	-15.0%	1.000	+0.00	~
`graphic_geometric_medium_avif.avif@medium`	0.0ms	0.0ms	+12.6%	1.000	+0.00	~
`photo_perlin_tiny_webp.webp@high`	1.3ms	1.1ms	-12.6%	1.000	+0.00	~
`photo_perlin_small_avif.avif@low`	0.0ms	0.0ms	+12.4%	1.000	+0.00	~
`graphic_geometric_tiny_jpeg.jpeg@low`	2.1ms	2.4ms	+12.4%	1.000	+0.00	~
`deep_color_10bit_small_heic.heic@medium`	47.1ms	52.6ms	+11.5%	1.000	+0.00	~
`photo_perlin_tiny_webp.webp@low`	1.2ms	1.3ms	+11.2%	1.000	+0.00	~
`photo_noise_medium_avif.avif@low`	0.0ms	0.0ms	+10.5%	1.000	+0.00	~
`photo_perlin_large_bmp.bmp@low`	3.7ms	3.3ms	-10.4%	1.000	+0.00	~
`photo_perlin_xlarge_bmp.bmp@medium`	792.3ms	711.3ms	-10.2%	1.000	+0.00	~
`animated_translation_tiny_apng.apng@high`	9.3ms	10.2ms	+9.6%	1.000	+0.00	~
`photo_noise_small_heic.heic@medium`	63.3ms	69.2ms	+9.4%	1.000	+0.00	~
`graphic_geometric_tiny_avif.avif@high`	6.1ms	6.7ms	+9.3%	1.000	+0.00	~
`graphic_palette_tiny_tiff.tiff@low`	3.0ms	2.7ms	-9.1%	1.000	+0.00	~
`animated_translation_small_apng.apng@medium`	30.4ms	27.7ms	-9.0%	1.000	+0.00	~
`photo_perlin_small_bmp.bmp@low`	0.6ms	0.6ms	-8.8%	1.000	+0.00	~
`graphic_geometric_medium_avif.avif@low`	0.1ms	0.0ms	-8.7%	1.000	+0.00	~
`photo_perlin_small_tiff.tiff@low`	4.0ms	3.7ms	-8.7%	1.000	+0.00	~
`photo_perlin_small_heic.heic@low`	47.4ms	43.5ms	-8.3%	1.000	+0.00	~
`graphic_geometric_tiny_png.png@high`	4.4ms	4.1ms	-8.2%	1.000	+0.00	~
`fat_avif_noise_xlarge.avif@low`	0.1ms	0.1ms	+7.7%	1.000	+0.00	~
`photo_perlin_tiny_jpeg.jpeg@high`	2.5ms	2.3ms	-7.6%	1.000	+0.00	~
`photo_noise_large_jpeg.jpeg@high`	370.7ms	396.3ms	+6.9%	1.000	+0.00	~
`animated_translation_small_apng.apng@high`	28.5ms	26.5ms	-6.8%	1.000	+0.00	~
`photo_noise_medium_avif.avif@high`	108.7ms	116.1ms	+6.8%	1.000	+0.00	~
`graphic_geometric_tiny_jpeg.jpeg@high`	2.5ms	2.3ms	-6.7%	1.000	+0.00	~
`fat_bmp_noise_xlarge.bmp@low`	58.5ms	54.7ms	-6.5%	1.000	+0.00	~
`graphic_palette_tiny_tiff.tiff@high`	3.3ms	3.1ms	-6.4%	1.000	+0.00	~
`photo_perlin_tiny_heic.heic@low`	6.7ms	7.1ms	+6.4%	1.000	+0.00	~
`photo_perlin_tiny_heic.heic@high`	10.2ms	9.6ms	-6.2%	1.000	+0.00	~
`photo_perlin_large_bmp.bmp@medium`	224.7ms	211.5ms	-5.9%	1.000	+0.00	~
`photo_noise_xlarge_avif.avif@low`	0.1ms	0.1ms	-5.7%	1.000	+0.00	~
`photo_noise_small_avif.avif@high`	44.7ms	42.3ms	-5.5%	1.000	+0.00	~
`animated_translation_medium_apng.apng@medium`	101.2ms	95.9ms	-5.2%	1.000	+0.00	~
`photo_perlin_small_heic.heic@medium`	79.9ms	83.9ms	+5.0%	1.000	+0.00	~
`photo_perlin_medium_bmp.bmp@low`	1.1ms	1.1ms	-4.8%	1.000	+0.00	~
`photo_noise_medium_webp.webp@high`	157.5ms	150.0ms	-4.8%	1.000	+0.00	~
`animated_redraw_tiny_gif.gif@high`	4.7ms	4.4ms	-4.7%	1.000	+0.00	~
`photo_perlin_xlarge_tiff.tiff@low`	435.1ms	455.7ms	+4.7%	1.000	+0.00	~
`deep_color_10bit_small_avif.avif@low`	0.0ms	0.0ms	+4.7%	1.000	+0.00	~
`photo_perlin_small_heic.heic@high`	55.9ms	53.3ms	-4.7%	1.000	+0.00	~
`photo_noise_small_heic.heic@low`	28.1ms	29.4ms	+4.7%	1.000	+0.00	~
`photo_perlin_small_bmp.bmp@medium`	8.3ms	7.9ms	-4.5%	1.000	+0.00	~
`photo_perlin_medium_heic.heic@high`	261.3ms	250.6ms	-4.1%	1.000	+0.00	~
`animated_translation_medium_apng.apng@low`	47.0ms	45.1ms	-4.1%	1.000	+0.00	~
`photo_perlin_medium_tiff.tiff@high`	29.2ms	28.0ms	-3.9%	1.000	+0.00	~
`vector_geometric_tiny_svg.svg@low`	14.9ms	15.5ms	+3.8%	1.000	+0.00	~
`fat_bmp_noise_xlarge.bmp@high`	7072.5ms	6807.7ms	-3.7%	1.000	+0.00	~
`path_thin_gradient_medium_webp.webp@medium`	3685.6ms	3548.5ms	-3.7%	1.000	+0.00	~
`photo_perlin_tiny_jpeg.jpeg@medium`	2.4ms	2.3ms	-3.7%	1.000	+0.00	~
`graphic_geometric_tiny_png.png@medium`	4.2ms	4.0ms	-3.7%	1.000	+0.00	~
`transparent_overlay_large_webp.webp@medium`	924.6ms	958.1ms	+3.6%	1.000	+0.00	~
`transparent_overlay_medium_png.png@low`	179.7ms	173.3ms	-3.6%	1.000	+0.00	~
`animated_translation_small_apng.apng@low`	14.1ms	13.6ms	-3.6%	1.000	+0.00	~
`animated_redraw_tiny_gif.gif@low`	2.1ms	2.0ms	-3.5%	1.000	+0.00	~
`photo_perlin_medium_tiff.tiff@medium`	28.7ms	27.7ms	-3.5%	1.000	+0.00	~
`path_thin_gradient_medium_webp.webp@low`	3453.8ms	3570.5ms	+3.4%	1.000	+0.00	~
`photo_perlin_tiny_heic.heic@medium`	13.3ms	12.9ms	-3.4%	1.000	+0.00	~
`path_thin_gradient_medium_jpeg.jpeg@medium`	159.5ms	164.5ms	+3.2%	1.000	+0.00	~
`path_thin_gradient_small_jpeg.jpeg@high`	17.1ms	16.6ms	-3.0%	1.000	+0.00	~
`photo_perlin_medium_bmp.bmp@medium`	73.4ms	71.3ms	-2.9%	1.000	+0.00	~
`path_thin_gradient_small_jpeg.jpeg@low`	17.3ms	16.8ms	-2.9%	1.000	+0.00	~
`animated_redraw_tiny_gif.gif@medium`	5.1ms	5.2ms	+2.8%	1.000	+0.00	~
`graphic_geometric_large_jpeg.jpeg@high`	1165.6ms	1132.8ms	-2.8%	1.000	+0.00	~
`photo_perlin_large_heic.heic@medium`	3247.3ms	3156.9ms	-2.8%	1.000	+0.00	~
`vector_geometric_tiny_svg.svg@medium`	15.4ms	15.0ms	-2.7%	1.000	+0.00	~
`transparent_overlay_small_png.png@low`	26.2ms	25.5ms	-2.7%	1.000	+0.00	~
`text_screenshot_small_png.png@high`	49.5ms	48.1ms	-2.7%	1.000	+0.00	~
`animated_translation_tiny_apng.apng@medium`	10.3ms	10.0ms	-2.5%	1.000	+0.00	~
`fat_bmp_noise_xlarge.bmp@medium`	5887.7ms	5742.0ms	-2.5%	1.000	+0.00	~
`photo_noise_small_webp.webp@medium`	9.9ms	9.6ms	-2.4%	1.000	+0.00	~
`photo_perlin_small_png.png@high`	83.1ms	81.1ms	-2.4%	1.000	+0.00	~
`vector_with_script_medium_svg.svg@high`	232.8ms	227.3ms	-2.4%	1.000	+0.00	~
`graphic_geometric_medium_png.png@low`	2932.9ms	2863.4ms	-2.4%	1.000	+0.00	~
`photo_noise_small_png.png@high`	42.1ms	43.1ms	+2.4%	1.000	+0.00	~
`path_text_on_flat_medium_jpeg.jpeg@high`	51.2ms	50.0ms	-2.3%	1.000	+0.00	~
`path_thin_gradient_small_jpeg.jpeg@medium`	17.2ms	16.8ms	-2.3%	1.000	+0.00	~
`transparent_overlay_large_webp.webp@low`	983.1ms	960.2ms	-2.3%	1.000	+0.00	~
`animated_translation_large_apng.apng@low`	758.4ms	740.8ms	-2.3%	1.000	+0.00	~
`vector_geometric_tiny_svgz.svgz@low`	22.4ms	22.9ms	+2.3%	1.000	+0.00	~
`photo_perlin_tiny_avif.avif@medium`	4.2ms	4.3ms	+2.3%	1.000	+0.00	~
`photo_perlin_tiny_bmp.bmp@high`	6.9ms	7.1ms	+2.3%	1.000	+0.00	~
`vector_geometric_medium_svg.svg@low`	232.8ms	227.7ms	-2.2%	1.000	+0.00	~
`photo_noise_xlarge_jpeg.jpeg@high`	1382.6ms	1411.9ms	+2.1%	1.000	+0.00	~
`path_thin_gradient_large_jpeg.jpeg@high`	1370.3ms	1341.8ms	-2.1%	1.000	+0.00	~
`photo_perlin_tiny_bmp.bmp@medium`	6.4ms	6.5ms	+2.0%	1.000	+0.00	~
`graphic_geometric_medium_jpeg.jpeg@medium`	66.2ms	67.6ms	+2.0%	1.000	+0.00	~
`photo_perlin_tiny_png.png@medium`	22.6ms	23.0ms	+2.0%	1.000	+0.00	~
`photo_noise_small_png.png@medium`	51.8ms	52.8ms	+2.0%	1.000	+0.00	~
`animated_translation_tiny_apng.apng@low`	5.2ms	5.1ms	-2.0%	1.000	+0.00	~
`deep_color_10bit_small_avif.avif@medium`	0.0ms	0.0ms	+2.0%	1.000	+0.00	~
`photo_perlin_medium_heic.heic@low`	270.8ms	265.8ms	-1.8%	1.000	+0.00	~
`path_text_on_flat_small_jpeg.jpeg@high`	21.4ms	21.0ms	-1.8%	1.000	+0.00	~
`text_screenshot_large_jpeg.jpeg@low`	444.3ms	452.3ms	+1.8%	1.000	+0.00	~
`vector_geometric_large_canvas_svg.svg@low`	231.5ms	227.3ms	-1.8%	1.000	+0.00	~
`photo_perlin_large_png.png@medium`	2928.9ms	2876.5ms	-1.8%	1.000	+0.00	~
`text_screenshot_small_png.png@medium`	48.8ms	47.9ms	-1.8%	1.000	+0.00	~
`path_text_on_flat_small_jpeg.jpeg@low`	21.2ms	20.9ms	-1.8%	1.000	+0.00	~
`vector_with_script_medium_svg.svg@low`	229.4ms	225.3ms	-1.7%	1.000	+0.00	~
`text_screenshot_small_png.png@low`	62.5ms	61.4ms	-1.7%	1.000	+0.00	~
`photo_perlin_large_heic.heic@low`	2223.0ms	2185.0ms	-1.7%	1.000	+0.00	~
`graphic_geometric_tiny_jpeg.jpeg@medium`	2.3ms	2.3ms	-1.6%	1.000	+0.00	~
`vector_geometric_large_canvas_svg.svg@medium`	230.2ms	226.4ms	-1.6%	1.000	+0.00	~
`vector_with_script_small_svgz.svgz@low`	240.3ms	236.5ms	-1.6%	1.000	+0.00	~
`photo_perlin_medium_png.png@high`	535.3ms	527.1ms	-1.5%	1.000	+0.00	~
`graphic_geometric_small_webp.webp@high`	41.2ms	40.6ms	-1.5%	1.000	+0.00	~
`photo_perlin_medium_bmp.bmp@high`	81.6ms	80.5ms	-1.4%	1.000	+0.00	~
`animated_redraw_small_gif.gif@low`	9.1ms	9.0ms	-1.4%	1.000	+0.00	~
`transparent_overlay_large_webp.webp@high`	1326.8ms	1345.5ms	+1.4%	1.000	+0.00	~
`graphic_geometric_medium_jpeg.jpeg@high`	67.3ms	66.4ms	-1.4%	1.000	+0.00	~
`photo_noise_xlarge_png.png@high`	1823.9ms	1798.4ms	-1.4%	1.000	+0.00	~
`path_text_on_flat_medium_jpeg.jpeg@medium`	51.1ms	50.4ms	-1.4%	1.000	+0.00	~
`animated_redraw_small_gif.gif@medium`	38.4ms	37.9ms	-1.3%	1.000	+0.00	~
`photo_perlin_small_png.png@low`	88.5ms	87.3ms	-1.3%	1.000	+0.00	~
`photo_perlin_small_jpeg.jpeg@medium`	34.4ms	33.9ms	-1.3%	1.000	+0.00	~
`photo_perlin_large_heic.heic@high`	1922.9ms	1898.6ms	-1.3%	1.000	+0.00	~
`path_text_on_flat_medium_png.png@low`	1700.4ms	1678.9ms	-1.3%	1.000	+0.00	~
`photo_perlin_small_bmp.bmp@high`	8.9ms	8.7ms	-1.2%	1.000	+0.00	~
`path_thin_gradient_medium_jpeg.jpeg@low`	162.0ms	164.0ms	+1.2%	1.000	+0.00	~
`transparent_overlay_medium_webp.webp@high`	185.7ms	183.4ms	-1.2%	1.000	+0.00	~
`graphic_geometric_medium_jpeg.jpeg@low`	66.7ms	65.9ms	-1.2%	1.000	+0.00	~
`graphic_palette_medium_gif.gif@low`	21.3ms	21.0ms	-1.2%	1.000	+0.00	~
`graphic_palette_tiny_tiff.tiff@medium`	3.2ms	3.3ms	+1.2%	1.000	+0.00	~
`graphic_geometric_small_webp.webp@medium`	40.8ms	41.3ms	+1.2%	1.000	+0.00	~
`vector_geometric_medium_svg.svg@high`	232.7ms	229.9ms	-1.2%	1.000	+0.00	~
`path_text_on_flat_large_jpeg.jpeg@high`	610.4ms	603.2ms	-1.2%	1.000	+0.00	~
`graphic_geometric_large_jpeg.jpeg@medium`	1146.1ms	1132.6ms	-1.2%	1.000	+0.00	~
`photo_noise_large_webp.webp@high`	844.2ms	834.4ms	-1.2%	1.000	+0.00	~
`photo_perlin_xlarge_tiff.tiff@high`	450.6ms	445.5ms	-1.1%	1.000	+0.00	~
`fat_png_noise_xlarge.png@low`	3975.2ms	3930.7ms	-1.1%	1.000	+0.00	~
`deep_color_10bit_small_heic.heic@low`	22.8ms	23.0ms	+1.1%	1.000	+0.00	~
`transparent_overlay_small_png.png@medium`	24.1ms	24.3ms	+1.0%	1.000	+0.00	~
`graphic_geometric_large_jpeg.jpeg@low`	1142.9ms	1131.1ms	-1.0%	1.000	+0.00	~
`photo_perlin_tiny_jpeg.jpeg@low`	2.4ms	2.4ms	+1.0%	1.000	+0.00	~
`photo_perlin_xlarge_heic.heic@high`	8080.2ms	7999.4ms	-1.0%	1.000	+0.00	~
`photo_perlin_medium_heic.heic@medium`	424.2ms	420.0ms	-1.0%	1.000	+0.00	~
`graphic_geometric_small_webp.webp@low`	40.8ms	40.4ms	-1.0%	1.000	+0.00	~
`photo_perlin_medium_tiff.tiff@low`	26.6ms	26.8ms	+1.0%	1.000	+0.00	~
`vector_geometric_tiny_svgz.svgz@medium`	22.5ms	22.3ms	-1.0%	1.000	+0.00	~
`text_screenshot_large_png.png@high`	7827.6ms	7753.3ms	-0.9%	1.000	+0.00	~
`path_text_on_flat_large_png.png@low`	6652.1ms	6590.1ms	-0.9%	1.000	+0.00	~
`photo_perlin_tiny_webp.webp@medium`	1.2ms	1.2ms	+0.9%	1.000	+0.00	~
`photo_perlin_small_jpeg.jpeg@low`	34.4ms	34.1ms	-0.9%	1.000	+0.00	~
`vector_with_script_small_svg.svg@low`	22.4ms	22.2ms	-0.9%	1.000	+0.00	~
`photo_noise_xlarge_jpeg.jpeg@medium`	1374.5ms	1387.1ms	+0.9%	1.000	+0.00	~
`photo_perlin_large_tiff.tiff@high`	186.0ms	184.3ms	-0.9%	1.000	+0.00	~
`transparent_overlay_medium_png.png@high`	207.1ms	205.2ms	-0.9%	1.000	+0.00	~
`photo_noise_small_avif.avif@medium`	45.5ms	45.9ms	+0.9%	1.000	+0.00	~
`vector_with_script_small_svgz.svgz@medium`	239.9ms	237.9ms	-0.9%	1.000	+0.00	~
`text_screenshot_medium_jpeg.jpeg@medium`	48.7ms	48.3ms	-0.8%	1.000	+0.00	~
`graphic_geometric_medium_avif.avif@high`	2744.4ms	2721.3ms	-0.8%	1.000	+0.00	~
`graphic_palette_medium_gif.gif@high`	59.3ms	58.8ms	-0.8%	1.000	+0.00	~
`photo_noise_large_webp.webp@medium`	874.7ms	867.4ms	-0.8%	1.000	+0.00	~
`path_thin_gradient_large_png.png@high`	16990.3ms	16849.7ms	-0.8%	1.000	+0.00	~
`text_screenshot_large_png.png@low`	7946.2ms	7880.8ms	-0.8%	1.000	+0.00	~
`text_screenshot_medium_png.png@medium`	650.5ms	655.8ms	+0.8%	1.000	+0.00	~
`vector_geometric_small_svg.svg@high`	22.3ms	22.1ms	-0.8%	1.000	+0.00	~
`vector_with_script_small_svg.svg@high`	22.3ms	22.2ms	-0.8%	1.000	+0.00	~
`animated_redraw_small_gif.gif@high`	39.1ms	38.8ms	-0.8%	1.000	+0.00	~
`graphic_palette_medium_gif.gif@medium`	59.0ms	58.5ms	-0.8%	1.000	+0.00	~
`vector_geometric_tiny_svgz.svgz@high`	22.6ms	22.5ms	-0.8%	1.000	+0.00	~
`photo_perlin_xlarge_heic.heic@medium`	13946.4ms	13833.9ms	-0.8%	1.000	+0.00	~
`photo_perlin_small_jpeg.jpeg@high`	34.0ms	33.7ms	-0.8%	1.000	+0.00	~
`photo_noise_xlarge_png.png@low`	1766.4ms	1752.6ms	-0.8%	1.000	+0.00	~
`photo_noise_xlarge_webp.webp@high`	1411.0ms	1400.1ms	-0.8%	1.000	+0.00	~
`photo_noise_small_png.png@low`	51.7ms	52.1ms	+0.8%	1.000	+0.00	~
`transparent_overlay_medium_png.png@medium`	168.7ms	167.4ms	-0.8%	1.000	+0.00	~
`fat_png_noise_xlarge.png@medium`	4010.2ms	3979.6ms	-0.8%	1.000	+0.00	~
`photo_noise_medium_webp.webp@medium`	155.8ms	157.0ms	+0.8%	1.000	+0.00	~
`transparent_overlay_medium_webp.webp@medium`	124.8ms	123.9ms	-0.7%	1.000	+0.00	~
`photo_perlin_small_avif.avif@high`	58.1ms	58.5ms	+0.7%	1.000	+0.00	~
`path_thin_gradient_large_jpeg.jpeg@medium`	1345.2ms	1354.5ms	+0.7%	1.000	+0.00	~
`animated_translation_xlarge_apng.apng@low`	5644.3ms	5605.8ms	-0.7%	1.000	+0.00	~
`photo_noise_large_avif.avif@high`	1328.3ms	1319.3ms	-0.7%	1.000	+0.00	~
`photo_noise_xlarge_avif.avif@high`	3532.2ms	3508.6ms	-0.7%	1.000	+0.00	~
`vector_geometric_small_svg.svg@medium`	22.2ms	22.0ms	-0.7%	1.000	+0.00	~
`text_screenshot_large_jpeg.jpeg@high`	445.1ms	442.1ms	-0.7%	1.000	+0.00	~
`photo_perlin_small_webp.webp@high`	54.5ms	54.1ms	-0.7%	1.000	+0.00	~
`path_thin_gradient_large_png.png@low`	21900.6ms	21759.0ms	-0.6%	1.000	+0.00	~
`photo_noise_small_heic.heic@high`	68.4ms	67.9ms	-0.6%	1.000	+0.00	~
`graphic_palette_small_png.png@low`	141.6ms	140.7ms	-0.6%	1.000	+0.00	~
`deep_color_10bit_small_avif.avif@high`	45.7ms	46.0ms	+0.6%	1.000	+0.00	~
`photo_perlin_medium_jpeg.jpeg@high`	143.9ms	144.8ms	+0.6%	1.000	+0.00	~
`photo_noise_small_webp.webp@high`	9.3ms	9.3ms	+0.6%	1.000	+0.00	~
`text_screenshot_medium_png.png@low`	656.1ms	652.0ms	-0.6%	1.000	+0.00	~
`photo_perlin_tiny_png.png@low`	17.8ms	17.7ms	-0.6%	1.000	+0.00	~
`vector_geometric_large_canvas_svg.svg@high`	236.3ms	237.7ms	+0.6%	1.000	+0.00	~
`animated_redraw_xlarge_gif.gif@low`	729.8ms	725.4ms	-0.6%	1.000	+0.00	~
`transparent_overlay_small_png.png@high`	28.2ms	28.3ms	+0.6%	1.000	+0.00	~
`photo_perlin_large_bmp.bmp@high`	260.1ms	261.7ms	+0.6%	1.000	+0.00	~
`photo_noise_small_webp.webp@low`	10.0ms	10.0ms	+0.6%	1.000	+0.00	~
`animated_redraw_large_gif.gif@high`	857.6ms	852.9ms	-0.5%	1.000	+0.00	~
`text_screenshot_medium_jpeg.jpeg@low`	48.1ms	48.4ms	+0.5%	1.000	+0.00	~
`path_thin_gradient_medium_png.png@high`	11380.7ms	11322.0ms	-0.5%	1.000	+0.00	~
`vector_geometric_tiny_svg.svg@high`	15.5ms	15.6ms	+0.5%	1.000	+0.00	~
`photo_noise_xlarge_webp.webp@low`	1489.7ms	1497.1ms	+0.5%	1.000	+0.00	~
`vector_with_script_small_svgz.svgz@high`	242.4ms	241.2ms	-0.5%	1.000	+0.00	~
`graphic_palette_small_png.png@medium`	227.7ms	226.6ms	-0.5%	1.000	+0.00	~
`path_text_on_flat_large_png.png@medium`	6641.1ms	6609.0ms	-0.5%	1.000	+0.00	~
`photo_perlin_large_png.png@high`	3216.8ms	3201.3ms	-0.5%	1.000	+0.00	~
`path_thin_gradient_medium_png.png@low`	15304.8ms	15232.3ms	-0.5%	1.000	+0.00	~
`graphic_geometric_tiny_png.png@low`	4.0ms	4.0ms	+0.5%	1.000	+0.00	~
`graphic_geometric_medium_png.png@medium`	5759.5ms	5734.3ms	-0.4%	1.000	+0.00	~
`photo_perlin_small_png.png@medium`	87.7ms	87.3ms	-0.4%	1.000	+0.00	~
`fat_tiff_perlin_xlarge.tiff@low`	1766.2ms	1758.8ms	-0.4%	1.000	+0.00	~
`animated_translation_xlarge_apng.apng@medium`	5636.2ms	5613.1ms	-0.4%	1.000	+0.00	~
`animated_redraw_large_gif.gif@medium`	866.2ms	869.7ms	+0.4%	1.000	+0.00	~
`transparent_overlay_medium_webp.webp@low`	124.5ms	125.0ms	+0.4%	1.000	+0.00	~
`text_screenshot_large_png.png@medium`	7942.5ms	7911.3ms	-0.4%	1.000	+0.00	~
`path_text_on_flat_large_jpeg.jpeg@medium`	604.0ms	601.6ms	-0.4%	1.000	+0.00	~
`photo_perlin_large_tiff.tiff@low`	179.5ms	178.8ms	-0.4%	1.000	+0.00	~
`animated_redraw_large_gif.gif@low`	143.5ms	143.0ms	-0.4%	1.000	+0.00	~
`photo_noise_large_jpeg.jpeg@medium`	372.6ms	371.2ms	-0.4%	1.000	+0.00	~
`photo_perlin_medium_jpeg.jpeg@medium`	144.2ms	144.7ms	+0.4%	1.000	+0.00	~
`fat_avif_noise_xlarge.avif@high`	9660.2ms	9626.4ms	-0.4%	1.000	+0.00	~
`animated_translation_xlarge_apng.apng@high`	5545.5ms	5564.9ms	+0.3%	1.000	+0.00	~
`path_text_on_flat_medium_webp.webp@high`	76.2ms	75.9ms	-0.3%	1.000	+0.00	~
`photo_perlin_small_webp.webp@low`	60.6ms	60.4ms	-0.3%	1.000	+0.00	~
`vector_geometric_small_svg.svg@low`	22.2ms	22.2ms	+0.3%	1.000	+0.00	~
`photo_perlin_large_png.png@low`	2881.7ms	2891.1ms	+0.3%	1.000	+0.00	~
`photo_noise_medium_webp.webp@low`	160.3ms	159.8ms	-0.3%	1.000	+0.00	~
`photo_perlin_medium_jpeg.jpeg@low`	145.3ms	144.9ms	-0.3%	1.000	+0.00	~
`path_text_on_flat_medium_png.png@medium`	1689.5ms	1684.1ms	-0.3%	1.000	+0.00	~
`text_screenshot_medium_png.png@high`	655.3ms	653.2ms	-0.3%	1.000	+0.00	~
`photo_noise_xlarge_webp.webp@medium`	1464.6ms	1469.0ms	+0.3%	1.000	+0.00	~
`path_thin_gradient_large_jpeg.jpeg@low`	1356.2ms	1352.3ms	-0.3%	1.000	+0.00	~
`path_thin_gradient_medium_png.png@medium`	16224.6ms	16271.2ms	+0.3%	1.000	+0.00	~
`path_thin_gradient_medium_jpeg.jpeg@high`	164.3ms	164.8ms	+0.3%	1.000	+0.00	~
`photo_noise_large_jpeg.jpeg@low`	371.4ms	370.3ms	-0.3%	1.000	+0.00	~
`photo_perlin_small_webp.webp@medium`	57.3ms	57.2ms	-0.3%	1.000	+0.00	~
`photo_perlin_xlarge_heic.heic@low`	9449.1ms	9423.5ms	-0.3%	1.000	+0.00	~
`animated_translation_large_apng.apng@medium`	745.9ms	744.0ms	-0.3%	1.000	+0.00	~
`photo_noise_medium_avif.avif@medium`	123.0ms	123.3ms	+0.2%	1.000	+0.00	~
`photo_noise_large_avif.avif@medium`	1687.4ms	1683.3ms	-0.2%	1.000	+0.00	~
`vector_with_script_small_svg.svg@medium`	22.1ms	22.1ms	+0.2%	1.000	+0.00	~
`photo_noise_large_webp.webp@low`	905.1ms	907.1ms	+0.2%	1.000	+0.00	~
`text_screenshot_medium_jpeg.jpeg@high`	48.0ms	47.9ms	-0.2%	1.000	+0.00	~
`transparent_overlay_large_png.png@medium`	1189.9ms	1187.4ms	-0.2%	1.000	+0.00	~
`fat_tiff_perlin_xlarge.tiff@high`	1804.6ms	1808.1ms	+0.2%	1.000	+0.00	~
`path_text_on_flat_medium_webp.webp@medium`	77.8ms	77.9ms	+0.2%	1.000	+0.00	~
`photo_perlin_medium_png.png@medium`	519.7ms	520.6ms	+0.2%	1.000	+0.00	~
`photo_perlin_xlarge_tiff.tiff@medium`	447.5ms	448.2ms	+0.2%	1.000	+0.00	~
`photo_noise_xlarge_png.png@medium`	1748.5ms	1746.0ms	-0.1%	1.000	+0.00	~
`path_thin_gradient_large_png.png@medium`	24237.0ms	24270.9ms	+0.1%	1.000	+0.00	~
`path_text_on_flat_large_webp.webp@medium`	4015.0ms	4020.6ms	+0.1%	1.000	+0.00	~
`vector_geometric_medium_svg.svg@medium`	226.7ms	226.4ms	-0.1%	1.000	+0.00	~
`animated_redraw_xlarge_gif.gif@medium`	4519.1ms	4513.1ms	-0.1%	1.000	+0.00	~
`graphic_geometric_medium_png.png@high`	5749.5ms	5756.9ms	+0.1%	1.000	+0.00	~
`path_text_on_flat_small_jpeg.jpeg@medium`	21.0ms	21.0ms	-0.1%	1.000	+0.00	~
`graphic_palette_small_png.png@high`	242.5ms	242.2ms	-0.1%	1.000	+0.00	~
`photo_perlin_large_tiff.tiff@medium`	185.0ms	184.7ms	-0.1%	1.000	+0.00	~
`text_screenshot_large_jpeg.jpeg@medium`	444.3ms	444.8ms	+0.1%	1.000	+0.00	~
`path_text_on_flat_large_png.png@high`	6653.5ms	6646.5ms	-0.1%	1.000	+0.00	~
`path_thin_gradient_medium_webp.webp@high`	3501.4ms	3505.0ms	+0.1%	1.000	+0.00	~
`photo_perlin_medium_png.png@low`	522.4ms	521.9ms	-0.1%	1.000	+0.00	~
`transparent_overlay_large_png.png@high`	1660.8ms	1659.2ms	-0.1%	1.000	+0.00	~
`graphic_geometric_tiny_avif.avif@medium`	5.6ms	5.6ms	+0.1%	1.000	+0.00	~
`transparent_overlay_large_png.png@low`	906.6ms	907.4ms	+0.1%	1.000	+0.00	~
`fat_avif_noise_xlarge.avif@medium`	12867.7ms	12855.9ms	-0.1%	1.000	+0.00	~
`vector_with_script_medium_svg.svg@medium`	226.2ms	226.4ms	+0.1%	1.000	+0.00	~
`path_text_on_flat_medium_png.png@high`	1647.7ms	1646.4ms	-0.1%	1.000	+0.00	~
`photo_noise_xlarge_avif.avif@medium`	4457.2ms	4454.0ms	-0.1%	1.000	+0.00	~
`photo_perlin_small_tiff.tiff@medium`	4.1ms	4.1ms	+0.1%	1.000	+0.00	~
`path_text_on_flat_medium_webp.webp@low`	79.1ms	79.0ms	-0.1%	1.000	+0.00	~
`animated_translation_large_apng.apng@high`	745.2ms	744.7ms	-0.1%	1.000	+0.00	~
`fat_png_noise_xlarge.png@high`	3572.0ms	3573.7ms	+0.0%	1.000	+0.00	~
`photo_noise_xlarge_jpeg.jpeg@low`	1381.3ms	1381.9ms	+0.0%	1.000	+0.00	~
`path_text_on_flat_large_jpeg.jpeg@low`	605.6ms	605.3ms	-0.0%	1.000	+0.00	~
`photo_perlin_small_tiff.tiff@high`	4.2ms	4.2ms	-0.0%	1.000	+0.00	~
`animated_redraw_xlarge_gif.gif@high`	4470.7ms	4471.8ms	+0.0%	1.000	+0.00	~
`fat_tiff_perlin_xlarge.tiff@medium`	1810.7ms	1811.0ms	+0.0%	1.000	+0.00	~
`deep_color_10bit_small_heic.heic@high`	31.4ms	31.4ms	+0.0%	1.000	+0.00	~

Compression

reduction_threshold=3.0pp, size_threshold=5.0%

Format	Cases	Median Δreduction (pp)	Worst Δpp	Status
`apng`	15	+0.0	+0.0	~
`avif`	30	+0.0	+0.0	~
`bmp`	18	+0.0	+0.0	~
`gif`	15	+0.0	+0.0	~
`heic`	21	+0.0	+0.0	~
`jpeg`	48	+0.0	+0.0	~
`png`	60	+0.0	+0.0	~
`svg`	18	+0.0	+0.0	~
`svgz`	6	+0.0	+0.0	~
`tiff`	18	+0.0	+0.0	~
`webp`	36	+0.0	+0.0	~

Estimation

estimation_threshold=10.0pp

Format×Path	Cases	Median Δerror (pp)	Worst Δpp	Status
apng × exact	15	+0.0	+0.0	~
avif × direct_encode_sample	9	+0.0	+0.0	~
avif × exact	21	+0.0	+0.0	~
bmp × exact	9	+0.0	+0.0	~
bmp × generic_fallback_sample	9	+0.0	+0.0	~
gif × exact	15	+0.0	+0.0	~
heic × direct_encode_sample	6	+0.0	+0.0	~
heic × exact	15	+0.0	+0.0	~
jpeg × direct_encode_sample	18	+0.0	+0.0	~
jpeg × exact	30	+0.0	+0.0	~
png × direct_encode_sample	21	+0.0	+0.0	~
png × exact	39	+0.0	+0.0	~
svg × exact	18	+0.0	+0.0	~
svgz × exact	6	+0.0	+0.0	~
tiff × direct_encode_sample	9	+0.0	+0.0	~
tiff × exact	9	+0.0	+0.0	~
webp × direct_encode_sample	12	+0.0	+0.0	~
webp × exact	24	+0.0	+0.0	~

_{Auto-posted by .github/workflows/bench-pr.yml. Re-runs on every push; this comment updates in place.}

Two compare-engine refinements driven by the first bench-pr.yml run on this PR (#34): the timing gate fired on noise but no real regressions, so tune the gate. ## Fix A — _STATS_MIN_ITERS 3 → 2 Welch's t-test at n=2 has 1 degree of freedom — wide CI, statistically weak. But it correctly returns p>0.05 for noisy cases (no false positive) and clears for tight ones. Better than the dumb 25% threshold in both directions. Matches accuracy mode's natural rhythm (the PR ships --repeat 2 for that mode given timeout budget). ## Fix C — noise-floor gate skips cases below absolute-ms floor Cases with baseline median below 5 ms (configurable via the new --noise-floor-min-ms flag) are not flagged by the relative gate. A 0.04 ms → 0.07 ms case is measurement quantization, not signal — the AVIF "already-optimized" early-skip path produced these at high volume in the first bench-pr run on #34 (9 of 13 flagged cases were sub-0.1 ms). Surfaced in the markdown header alongside threshold-pct and noise-floor-pct so PR comment readers see the gate definition. 3 new tests: - test_noise_floor_skipped_below_min_ms: 0.5→0.7ms case (+40%) doesn't flag - test_noise_floor_fires_at_or_above_min_ms: 100→140ms case (+40%) still flags - test_stats_engages_at_n_equals_2: 2 iters each side routes through stats path ## Empirical impact on PR #34's own bench run Re-running compare against the artifact from run 26044094317: - Before: 13 noise_floor_flags (9 sub-0.1ms AVIF + 4 BMP CPU-steal) - After: 3 noise_floor_flags (sub-ms quantization gone; CPU-steal remains) The 3 surviving flags are real shared-CI noise on single-threaded multi-100ms BMP cases — a separate axis (per-format noise floor or CPU-steal isolation) that's out of scope here. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-18T16:23:03Z

✅ Test & Coverage Report

Status: PASS | Tests: 1306 passed, 0 failed, 0 errors (1306 total)

🟢 Overall Coverage: 95.0%

Module	Statements	Covered	Missing	Coverage
`config.py`	50	50	0	100.0%
`estimation`	1252	1144	108	91.4%
`exceptions.py`	40	40	0	100.0%
`optimizers`	866	830	36	95.8%
`routers`	226	220	6	97.3%
`schemas.py`	68	68	0	100.0%
`security`	166	165	1	99.4%
`storage`	31	31	0	100.0%
`utils`	409	404	5	98.8%

_{Report generated at 2026-05-18 16:23:01 UTC from commit 334ff45}

…bench-baseline] The previous baseline.core.json was generated at --repeat 1 (pre-PR #34's accuracy-mode --repeat support). After #34 merged, the auto bench-baseline- update workflow ran the candidate at --repeat 2 but couldn't promote because the n=1 vs n=2 mismatch routes the compare to the noise-floor path, where single-threaded BMP cases on shared CI tripped 2 false-positive flags (opened as drift issue #35). Adopt the candidate from CI run 26048342324 directly as the pinned baseline so future comparisons are n=2 vs n=2 and engage Welch's t-test properly, escaping the bootstrap cycle. [skip bench-baseline] guards against the workflow firing on this commit and re-opening another drift issue. Closes #35. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 18, 2026 15:44

Copilot started reviewing on behalf of amitray007 May 18, 2026 15:45 View session

amitray007 self-assigned this May 18, 2026

Copilot AI reviewed May 18, 2026

View reviewed changes

Comment thread bench/runner/cli.py

Comment on lines +149 to +154

config = {

"warmup": args.warmup,

"repeat": args.repeat,

"stages": ["estimate", "optimize"],

}

iterations = run_accuracy_sync(cases, repeat=args.repeat, warmup=args.warmup)

amitray007 merged commit 6246632 into main May 18, 2026
2 of 3 checks passed

amitray007 deleted the bench-hardening-v2 branch May 18, 2026 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): accuracy --repeat support + auto-on-PR bench workflow#34

feat(bench): accuracy --repeat support + auto-on-PR bench workflow#34
amitray007 merged 2 commits into
mainfrom
bench-hardening-v2

amitray007 commented May 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented May 18, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

amitray007 commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

accuracy mode --repeat (closes the timing-noise residual)

bench-baseline-update.yml — sensible defaults

New bench-pr.yml — auto bench comment on PRs

Compare-engine noise tuning (commit 334ff45)

Empirical impact on this PR's own bench run

Docs

Test plan

Known follow-up

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

github-actions Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pare bench — PR #34

Bench compare: baseline.core.json → _head.json

Compare conditions

Timing

Per-format summary

Compression

Estimation

Uh oh!

github-actions Bot commented May 18, 2026

✅ Test & Coverage Report

🟢 Overall Coverage: 95.0%

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

amitray007 commented May 18, 2026 •

edited

Loading

accuracy mode `--repeat` (closes the timing-noise residual)

New `bench-pr.yml` — auto bench comment on PRs

Compare-engine noise tuning (commit `334ff45`)

github-actions Bot commented May 18, 2026 •

edited

Loading