feat: CI benchmark regression detection on PRs by ArthurZucker · Pull Request #2013 · huggingface/tokenizers

ArthurZucker · 2026-04-08T13:54:48Z

Summary

Adds automated benchmark regression detection to CI. Every PR gets a comment showing how it compares against main.

New benchmark suite (`ci_benchmark.rs`)

13 measurements covering all key performance surfaces:

Group	Benchmarks
bpe-gpt2	encode, encode-batch, encode-no-cache, encode-batch-no-cache
llama3	encode, encode-batch, encode-fast, encode-char-offsets, concurrent-4t
serialization	gpt2-load, llama3-load, llama3-save
train	bpe-small

Runs in ~4 minutes on CI.

GitHub Actions workflow (`benchmarks.yml`)

Push to main: runs benchmarks, stores baseline in gh-pages branch
Pull request: runs benchmarks, compares against baseline, posts/updates a single PR comment with the delta table
Alert threshold: 15% regression (warns, doesn't fail CI)
Uses benchmark-action/github-action-benchmark with criterion's bencher output format

Local numbers (Apple M-series, sample_size=10)

bpe-gpt2/encode              1.082 s   5.72 MiB/s
bpe-gpt2/encode-batch        762 ms    8.12 MiB/s
bpe-gpt2/encode-no-cache     1.292 s   4.79 MiB/s
bpe-gpt2/encode-batch-no-cache 244 ms  25.3 MiB/s
llama3/encode                1.060 s   5.84 MiB/s
llama3/encode-batch          207 ms   29.9 MiB/s
llama3/encode-fast           1.067 s   5.80 MiB/s
llama3/encode-char-offsets   178 ms   34.8 MiB/s
llama3/concurrent-4t          21.6 ms  19.1 MiB/s
serialization/gpt2-load       24.1 ms
serialization/llama3-load    157 ms
serialization/llama3-save     35.1 ms 195.7 MiB/s
train/bpe-small               26.0 ms 279 KiB/s

Setup needed

The gh-pages branch needs to be initialized once (can be empty) for the baseline storage to work. After the first push to main with this workflow, subsequent PRs will get comparison comments automatically.

Add a consolidated benchmark suite (`ci_benchmark`) and a GitHub Actions workflow that automatically compares performance against the main branch baseline on every PR. Benchmark coverage (13 measurements, ~4 min on CI): - BPE GPT-2: encode, batch, no-cache, batch-no-cache - Llama-3: encode, batch, encode-fast, char-offsets, concurrent-4t - Serialization: GPT-2 load, Llama-3 load, Llama-3 save - Training: BPE small corpus CI workflow: - On push to main: run benchmarks, store baseline in gh-pages branch - On PR: run benchmarks, compare vs baseline, post/update a single PR comment with the delta table - Alert threshold: 15% regression (warn, don't fail) Uses benchmark-action/github-action-benchmark with criterion's bencher output format for machine-readable results.

HuggingFaceDocBuilderDev · 2026-04-08T13:57:54Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

… Llama-3)

…ated refs

… PR trigger)

- pytest-benchmark based test suite covering: - BPE GPT-2: encode, encode_batch, multithreaded (4 workers) - Llama-3: encode, encode_batch, encode_fast, multithreaded, decode_batch - Async: async_encode_batch, async_encode_batch_fast - Serialization: from_file, to_str, from_str (roberta, llama3, albert) - Training: BPE small corpus - CI workflow: separate benchmark-python job with sccache + maturin - Re-enabled ci-benchmarks branch trigger for testing

…Python)

- New benchmark-trigger.yml: maintainer comments '/benchmark' on a PR to dispatch the benchmark workflow on the PR's ref - Upload steps gated on github.event_name == 'push' so workflow_dispatch (PR runs) never overwrite the baseline - Trigger requires MEMBER/OWNER/COLLABORATOR association

ArthurZucker · 2026-04-09T05:55:15Z

/benchmark

ArthurZucker · 2026-04-09T05:57:11Z

/benchmark

…en done - benchmark-trigger.yml creates a 'Benchmark Results' check on the PR head SHA - benchmarks.yml marks it in_progress at start, completed (success/failure) at end - The check body contains the comparison markdown table - Can be made a required check in branch protection rules

ArthurZucker · 2026-04-09T06:11:02Z

/benchmark

ArthurZucker · 2026-04-09T06:11:27Z

/benchmark

Rust: - Push to main: runs with --save-baseline main, uploads criterion data (tar.gz) + bencher output to HF Hub - workflow_dispatch: downloads criterion baseline, runs with --baseline main for automatic criterion comparison - criterion HTML report uploaded as GitHub Actions artifact (30 day retention) Python: - bench_output.json uploaded as GitHub Actions artifact - Baseline stored/compared via HF Hub as before Both: - Artifacts downloadable from the workflow run page for manual inspection - Comparison tables posted to PR comments

…nking

ArthurZucker added 4 commits April 8, 2026 16:16

fix: rustfmt + add github-token for PR comments

4582011

ci: require approval for benchmark runs on PRs

cac3a59

bench: add from_file + deserialize benchmarks to ci_benchmark

ab9170e

fix: split workflow into two jobs to avoid empty environment value

be8da27

ArthurZucker temporarily deployed to benchmarks April 8, 2026 14:38 — with GitHub Actions Inactive

fix: download benchmark data in CI + skip if output empty

0f0ec62

ArthurZucker temporarily deployed to benchmarks April 8, 2026 14:45 — with GitHub Actions Inactive

fix: use huggingface-cli for data download (handles gated models like…

f1377ef

… Llama-3)

ArthurZucker temporarily deployed to benchmarks April 8, 2026 14:46 — with GitHub Actions Inactive

fix: use uvx for huggingface-cli, add setup-uv step

f05e21f

ArthurZucker temporarily deployed to benchmarks April 8, 2026 20:48 — with GitHub Actions Inactive

fix: hf not huggingface-cli

b7b1cdb

ArthurZucker temporarily deployed to benchmarks April 8, 2026 20:51 — with GitHub Actions Inactive

fix: single download from hf-internal-testing/tokenizers-bench-data

780a964

ArthurZucker temporarily deployed to benchmarks April 8, 2026 20:59 — with GitHub Actions Inactive

fix: use hf-internal-testing dataset for BOTH jobs, remove all curl/g…

07a02d7

…ated refs

ArthurZucker temporarily deployed to benchmarks April 8, 2026 21:02 — with GitHub Actions Inactive

ArthurZucker added 11 commits April 8, 2026 23:12

Initialize gh-pages for benchmark data

6a9cfdc

upupdate

34ee03f

Initialize gh-pages for benchmark data

bcaaa73

ci: sccache, workflow_dispatch with PR comment, push-to-main only (no…

97b3181

… PR trigger)

ci: temporarily trigger on ci-benchmarks branch for testing

593911e

ci: touch tokenizers/ to trigger path filter

49aed98

ci: store baselines on HF Hub, drop github-action-benchmark + gh-pages

f4455cb

ci: also trigger on workflow file changes + doc touch

5df3de6

ci: skip upload if HF_TOKEN missing, remove ci-benchmarks branch trigger

149d81b

fix: skip benchmark tests when pytest-benchmark is not installed

54a7dec

ArthurZucker added 4 commits April 9, 2026 07:47

ci: Python bench comparison against saved baseline + PR comment

0e09dd4

ci: compare against baseline BEFORE uploading new one (both Rust and …

789f69d

…Python)

ci: remove ci-benchmarks branch trigger

c8634fb

ArthurZucker requested a review from McPatate April 9, 2026 06:12

ArthurZucker added 5 commits April 9, 2026 18:11

update

55c6fea

ci: pin macOS Python to 3.13 (3.14 breaks abi3 linking)

94d7c59

ci: add -undefined dynamic_lookup for macOS abi3 cross-compilation li…

582043e

…nking

?

fee69ff

ArthurZucker merged commit 5e35576 into main Apr 10, 2026
36 checks passed

ArthurZucker deleted the ci-benchmarks branch April 10, 2026 10:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: CI benchmark regression detection on PRs#2013

feat: CI benchmark regression detection on PRs#2013
ArthurZucker merged 32 commits into
mainfrom
ci-benchmarks

ArthurZucker commented Apr 8, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 8, 2026

Uh oh!

ArthurZucker commented Apr 9, 2026

Uh oh!

ArthurZucker commented Apr 9, 2026

Uh oh!

ArthurZucker commented Apr 9, 2026 •

edited

Loading

Uh oh!

ArthurZucker commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArthurZucker commented Apr 8, 2026

Summary

New benchmark suite (ci_benchmark.rs)

GitHub Actions workflow (benchmarks.yml)

Local numbers (Apple M-series, sample_size=10)

Setup needed

Uh oh!

HuggingFaceDocBuilderDev commented Apr 8, 2026

Uh oh!

ArthurZucker commented Apr 9, 2026

Uh oh!

ArthurZucker commented Apr 9, 2026

Uh oh!

ArthurZucker commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New benchmark suite (`ci_benchmark.rs`)

GitHub Actions workflow (`benchmarks.yml`)

ArthurZucker commented Apr 9, 2026 •

edited

Loading