Skip to content

feat: CI benchmark regression detection on PRs#2013

Merged
ArthurZucker merged 32 commits into
mainfrom
ci-benchmarks
Apr 10, 2026
Merged

feat: CI benchmark regression detection on PRs#2013
ArthurZucker merged 32 commits into
mainfrom
ci-benchmarks

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

Summary

Adds automated benchmark regression detection to CI. Every PR gets a comment showing how it compares against main.

New benchmark suite (ci_benchmark.rs)

13 measurements covering all key performance surfaces:

Group Benchmarks
bpe-gpt2 encode, encode-batch, encode-no-cache, encode-batch-no-cache
llama3 encode, encode-batch, encode-fast, encode-char-offsets, concurrent-4t
serialization gpt2-load, llama3-load, llama3-save
train bpe-small

Runs in ~4 minutes on CI.

GitHub Actions workflow (benchmarks.yml)

  • Push to main: runs benchmarks, stores baseline in gh-pages branch
  • Pull request: runs benchmarks, compares against baseline, posts/updates a single PR comment with the delta table
  • Alert threshold: 15% regression (warns, doesn't fail CI)
  • Uses benchmark-action/github-action-benchmark with criterion's bencher output format

Local numbers (Apple M-series, sample_size=10)

bpe-gpt2/encode              1.082 s   5.72 MiB/s
bpe-gpt2/encode-batch        762 ms    8.12 MiB/s
bpe-gpt2/encode-no-cache     1.292 s   4.79 MiB/s
bpe-gpt2/encode-batch-no-cache 244 ms  25.3 MiB/s
llama3/encode                1.060 s   5.84 MiB/s
llama3/encode-batch          207 ms   29.9 MiB/s
llama3/encode-fast           1.067 s   5.80 MiB/s
llama3/encode-char-offsets   178 ms   34.8 MiB/s
llama3/concurrent-4t          21.6 ms  19.1 MiB/s
serialization/gpt2-load       24.1 ms
serialization/llama3-load    157 ms
serialization/llama3-save     35.1 ms 195.7 MiB/s
train/bpe-small               26.0 ms 279 KiB/s

Setup needed

The gh-pages branch needs to be initialized once (can be empty) for the baseline storage to work. After the first push to main with this workflow, subsequent PRs will get comparison comments automatically.

Add a consolidated benchmark suite (`ci_benchmark`) and a GitHub Actions
workflow that automatically compares performance against the main branch
baseline on every PR.

Benchmark coverage (13 measurements, ~4 min on CI):
  - BPE GPT-2: encode, batch, no-cache, batch-no-cache
  - Llama-3: encode, batch, encode-fast, char-offsets, concurrent-4t
  - Serialization: GPT-2 load, Llama-3 load, Llama-3 save
  - Training: BPE small corpus

CI workflow:
  - On push to main: run benchmarks, store baseline in gh-pages branch
  - On PR: run benchmarks, compare vs baseline, post/update a single
    PR comment with the delta table
  - Alert threshold: 15% regression (warn, don't fail)

Uses benchmark-action/github-action-benchmark with criterion's bencher
output format for machine-readable results.
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

- pytest-benchmark based test suite covering:
  - BPE GPT-2: encode, encode_batch, multithreaded (4 workers)
  - Llama-3: encode, encode_batch, encode_fast, multithreaded, decode_batch
  - Async: async_encode_batch, async_encode_batch_fast
  - Serialization: from_file, to_str, from_str (roberta, llama3, albert)
  - Training: BPE small corpus
- CI workflow: separate benchmark-python job with sccache + maturin
- Re-enabled ci-benchmarks branch trigger for testing
- New benchmark-trigger.yml: maintainer comments '/benchmark' on a PR
  to dispatch the benchmark workflow on the PR's ref
- Upload steps gated on github.event_name == 'push' so workflow_dispatch
  (PR runs) never overwrite the baseline
- Trigger requires MEMBER/OWNER/COLLABORATOR association
@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

/benchmark

1 similar comment
@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

/benchmark

…en done

- benchmark-trigger.yml creates a 'Benchmark Results' check on the PR head SHA
- benchmarks.yml marks it in_progress at start, completed (success/failure) at end
- The check body contains the comparison markdown table
- Can be made a required check in branch protection rules
@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

ArthurZucker commented Apr 9, 2026

/benchmark

1 similar comment
@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

/benchmark

@ArthurZucker ArthurZucker requested a review from McPatate April 9, 2026 06:12
Rust:
- Push to main: runs with --save-baseline main, uploads criterion
  data (tar.gz) + bencher output to HF Hub
- workflow_dispatch: downloads criterion baseline, runs with --baseline main
  for automatic criterion comparison
- criterion HTML report uploaded as GitHub Actions artifact (30 day retention)

Python:
- bench_output.json uploaded as GitHub Actions artifact
- Baseline stored/compared via HF Hub as before

Both:
- Artifacts downloadable from the workflow run page for manual inspection
- Comparison tables posted to PR comments
@ArthurZucker ArthurZucker merged commit 5e35576 into main Apr 10, 2026
36 checks passed
@ArthurZucker ArthurZucker deleted the ci-benchmarks branch April 10, 2026 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants