bindings & bench: use mimalloc as global allocator on tested targets by sebpop · Pull Request #2073 · huggingface/tokenizers

sebpop · 2026-05-26T23:17:10Z

Profiles of tokenizer.encode_batch(...) show malloc/free/realloc
calls inside libc costing a measurable fraction of CPU cycles. At high
thread counts they emit cross-thread atomic operations on every free.
mimalloc has a thread-local free-list fast path (no atomics until a
cross-thread free is delayed) and is faster per call than glibc malloc.

#[global_allocator] is a process-wide setting; declaring it in a
library crate would make the choice global for every downstream Rust
binary that depends on tokenizers, conflict with any other
#[global_allocator] chosen by a consumer, and is broadly considered
bad practice. This patch therefore opts each binary into mimalloc
independently:

bindings/python/ (cdylib loaded by the Python interpreter) gains
an optional mimalloc feature, enabled by default, that registers
mimalloc::MiMalloc as the cdylib's global allocator.
bindings/node/ (cdylib loaded by Node) gains the same feature
and default.
tokenizers/benches/bpe_benchmark.rs registers mimalloc as the
bench binary's global allocator via mimalloc in
[dev-dependencies]. This way cargo bench numbers reflect the
same allocator the production wheels ship with, without imposing
any allocator choice on library consumers (dev-dependencies do not
propagate).

The tokenizers/ library crate (src/lib.rs, runtime feature set)
is intentionally unchanged.

Restricted to tested targets

The first iteration of this patch broke CI on every target other than
musllinux and recent macOS, for two distinct reasons:

libmimalloc-sys 0.1.49 passes -Werror=date-time unconditionally,
which fails on the older cross-compile gcc shipped by the
ubuntu-latest manylinux2014 builders for s390x, ppc64le, armv7,
riscv64, and 32-bit x86.
MSVC linker mismatch on Windows: mimalloc's C objects use the static
CRT (MT_StaticRelease) while esaxx_rs uses the dynamic CRT
(MD_DynamicRelease), producing LNK2038.

Rather than work around each of those, the mimalloc dependency and
the #[global_allocator] declaration are now restricted by a
cfg(...) predicate to platforms where the patch author has measured
the change:

aarch64-unknown-linux-gnu
x86_64-unknown-linux-gnu
aarch64-apple-darwin

On any other target the mimalloc crate is not pulled in (Cargo
honours [target.cfg.dependencies]) and the #[global_allocator]
line is #[cfg(false)], so the cdylib falls back transparently to
the system allocator and the build proceeds without error. The
mimalloc feature stays in default features so users on tested
targets get the win automatically; users on tested targets who
want the system allocator can --no-default-features.

Community testing of additional triples is welcome. Adding a triple
requires extending the same cfg(...) predicate in three places
(both binding Cargo.toml's + lib.rs, and the bench Cargo.toml +
bpe_benchmark.rs); the expressions are kept identical and reference
each other in comments.

cargo test --lib --features http 
 (tokenizers/, unchanged by this PR)
 201 passed, 0 failed.

Throughput on tested platforms

cargo bench --bench bpe_benchmark, bpe-encode/BPE GPT2 encode batch
(data/big.txt, 6.5 MB through the full post-processor):

  platform                                              threads  before        after         change
  --------                                              -------  ------        ------        ------
  Vera, NVIDIA, Linux aarch64, 88 physical cores         1T      3.96 MiB/s    6.18 MiB/s    +56 %
                                                         88T     17.75 MiB/s   19.40 MiB/s   +9.3 %
  AMD EPYC 9124, Linux x86_64, 16 physical cores         1T      3.84 MiB/s    5.19 MiB/s    +35 %
                                                         16T     23.40 MiB/s   42.72 MiB/s   +83 %
  Apple M3 Pro, macOS aarch64, 12 unified cores          1T      4.64 MiB/s    6.90 MiB/s    +48 %
                                                         12T     17.53 MiB/s   34.60 MiB/s   +97 %

The 1T wins (+35-56 %) are the same mechanism on every platform: per-
call malloc/free is on the critical path when there is no contention,
and mimalloc's single-thread fast path is roughly twice as cheap as
glibc / libsystem malloc.

The at-full-physical-cores wins differ by platform: large on EPYC
(+83 %) and M3 Pro (+97 %), modest on Vera (+9.3 %). On Vera the
bpe_benchmark hot path at 88T is bottlenecked by other
synchronisation primitives that this patch does not address; combined
with separate changes that target those primitives, mimalloc's
share-of-cycles win becomes a throughput win as well. Measured here
on top of main alone, the +9.3 % is honest.

Perf evidence on Vera (88 physical cores) at 88T,
perf record -g --call-graph fp -F 4999 on a Python process calling
tokenizer.encode_batch(docs, false) in a loop for 15 s (docs is
data/big.txt split into 999 ~6.5 KB chunks). Wheels built
-Ctarget-feature=+lse,+rcpc:

  symbol                                         before     after
  libc malloc / cfree / realloc (combined)       ~1.8 %      ~0.13 %  (-13x)
  mimalloc family (mi_page_malloc, mi_free, ...) -           ~0.96 %
  std::thread::local::LocalKey<T>::with          -          14.25 %

The libc malloc family drops by an order of magnitude. mimalloc's
fast path costs about half as much (~0.96 %) and brings in ~14 % of
cycles in LocalKey::with, the price of its thread-local heap.

Memory footprint

The throughput win above comes from mimalloc caching freed memory in
thread-local pools rather than returning it immediately to the OS;
the same mechanism increases the steady-state resident size of the
Python process. Measured on Vera, same Python script that ran the
throughput sweep above, RSS read from /proc/self/status after 50
calls to tokenizer.encode_batch(docs, false):

  metric                          before        after         delta
  ------                          ------        ------        ------
  cdylib `.so` on disk            10.08 MiB     10.24 MiB     +0.15 MiB
  process VmRSS (1T)              237 MiB       704 MiB       x2.97
  process VmRSS (88T)             720 MiB       2 035 MiB     x2.83
  process VmPeak virtual (88T)    6.1 GiB       9.2 GiB       x1.50

The on-disk cost is ~155 KiB of embedded libmimalloc. At runtime,
mimalloc holds onto free pages in per-thread heaps for reuse instead
of returning them to the kernel after every encode_batch call;
that cache is what makes the next call cheap, and what makes the
steady-state RSS ~3x larger. mimalloc does eventually purge unused
pages back to the OS, but on its own schedule, not synchronously.

This is the explicit trade-off: ~3x physical RSS for +56 % (1T) /
+9 % (88T) on this benchmark. Users for whom that trade is wrong
can --no-default-features to fall back to the system allocator.

HuggingFaceDocBuilderDev · 2026-05-28T08:13:11Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

McPatate · 2026-05-28T08:18:10Z

Cool results! Could you measure the results on different permutations platforms and architecture? So x86, apple silicon, windows etc?

Curious to see how it impacts memory footprint as well.

I ran the CI, I'll let you take a look! Thank you for investigating!

Profiles of `tokenizer.encode_batch(...)` show `malloc`/`free`/`realloc` calls inside libc costing a measurable fraction of CPU cycles. At high thread counts they emit cross-thread atomic operations on every free. mimalloc has a thread-local free-list fast path (no atomics until a cross-thread free is delayed) and is faster per call than glibc malloc. `#[global_allocator]` is a process-wide setting; declaring it in a library crate would make the choice global for every downstream Rust binary that depends on `tokenizers`, conflict with any other `#[global_allocator]` chosen by a consumer, and is broadly considered bad practice. This patch therefore opts each *binary* into mimalloc independently: - `bindings/python/` (cdylib loaded by the Python interpreter) gains an optional `mimalloc` feature, enabled by default, that registers `mimalloc::MiMalloc` as the cdylib's global allocator. - `bindings/node/` (cdylib loaded by Node) gains the same feature and default. - `tokenizers/benches/bpe_benchmark.rs` registers mimalloc as the bench binary's global allocator via `mimalloc` in `[dev-dependencies]`. This way `cargo bench` numbers reflect the same allocator the production wheels ship with, without imposing any allocator choice on library consumers (dev-dependencies do not propagate). The `tokenizers/` library crate (`src/lib.rs`, runtime feature set) is intentionally unchanged. # Restricted to tested targets The first iteration of this patch broke CI on every target other than musllinux and recent macOS, for two distinct reasons: - libmimalloc-sys 0.1.49 passes `-Werror=date-time` unconditionally, which fails on the older cross-compile gcc shipped by the ubuntu-latest manylinux2014 builders for s390x, ppc64le, armv7, riscv64, and 32-bit x86. - MSVC linker mismatch on Windows: mimalloc's C objects use the static CRT (`MT_StaticRelease`) while esaxx_rs uses the dynamic CRT (`MD_DynamicRelease`), producing `LNK2038`. Rather than work around each of those, the mimalloc dependency and the `#[global_allocator]` declaration are now restricted by a `cfg(...)` predicate to platforms where the patch author has measured the change: - `aarch64-unknown-linux-gnu` - `x86_64-unknown-linux-gnu` - `aarch64-apple-darwin` On any other target the mimalloc crate is not pulled in (Cargo honours `[target.cfg.dependencies]`) and the `#[global_allocator]` line is `#[cfg(false)]`, so the cdylib falls back transparently to the system allocator and the build proceeds without error. The `mimalloc` feature stays in `default` features so users on tested targets get the win automatically; users on tested targets who want the system allocator can `--no-default-features`. Community testing of additional triples is welcome. Adding a triple requires extending the same `cfg(...)` predicate in three places (both binding Cargo.toml's + lib.rs, and the bench Cargo.toml + bpe_benchmark.rs); the expressions are kept identical and reference each other in comments. cargo test --lib --features http (tokenizers/, unchanged by this PR): 201 passed, 0 failed. # Throughput on tested platforms `cargo bench --bench bpe_benchmark`, `bpe-encode/BPE GPT2 encode batch` (`data/big.txt`, 6.5 MB through the full post-processor): platform threads before after change -------- ------- ------ ------ ------ Vera, NVIDIA, Linux aarch64, 88 physical cores 1T 3.96 MiB/s 6.18 MiB/s +56 % 88T 17.75 MiB/s 19.40 MiB/s +9.3 % AMD EPYC 9124, Linux x86_64, 16 physical cores 1T 3.84 MiB/s 5.19 MiB/s +35 % 16T 23.40 MiB/s 42.72 MiB/s +83 % Apple M3 Pro, macOS aarch64, 12 unified cores 1T 4.64 MiB/s 6.90 MiB/s +48 % 12T 17.53 MiB/s 34.60 MiB/s +97 % The 1T wins (+35-56 %) are the same mechanism on every platform: per- call malloc/free is on the critical path when there is no contention, and mimalloc's single-thread fast path is roughly twice as cheap as glibc / libsystem malloc. The at-full-physical-cores wins differ by platform: large on EPYC (+83 %) and M3 Pro (+97 %), modest on Vera (+9.3 %). On Vera the `bpe_benchmark` hot path at 88T is bottlenecked by other synchronisation primitives that this patch does not address; combined with separate changes that target those primitives, mimalloc's share-of-cycles win becomes a throughput win as well. Measured here on top of `main` alone, the +9.3 % is honest. Perf evidence on Vera (88 physical cores) at 88T, `perf record -g --call-graph fp -F 4999` on a Python process calling `tokenizer.encode_batch(docs, false)` in a loop for 15 s (`docs` is `data/big.txt` split into 999 ~6.5 KB chunks). Wheels built `-Ctarget-feature=+lse,+rcpc`: symbol before after libc malloc / cfree / realloc (combined) ~1.8 % ~0.13 % (-13x) mimalloc family (mi_page_malloc, mi_free, ...) - ~0.96 % std::thread::local::LocalKey<T>::with - 14.25 % The libc malloc family drops by an order of magnitude. mimalloc's fast path costs about half as much (~0.96 %) and brings in ~14 % of cycles in `LocalKey::with`, the price of its thread-local heap. # Memory footprint The throughput win above comes from mimalloc caching freed memory in thread-local pools rather than returning it immediately to the OS; the same mechanism increases the steady-state resident size of the Python process. Measured on Vera, same Python script that ran the throughput sweep above, RSS read from `/proc/self/status` after 50 calls to `tokenizer.encode_batch(docs, false)`: metric before after delta ------ ------ ------ ------ cdylib `.so` on disk 10.08 MiB 10.24 MiB +0.15 MiB process VmRSS (1T) 237 MiB 704 MiB x2.97 process VmRSS (88T) 720 MiB 2 035 MiB x2.83 process VmPeak virtual (88T) 6.1 GiB 9.2 GiB x1.50 The on-disk cost is ~155 KiB of embedded libmimalloc. At runtime, mimalloc holds onto free pages in per-thread heaps for reuse instead of returning them to the kernel after every `encode_batch` call; that cache is what makes the next call cheap, and what makes the steady-state RSS ~3x larger. mimalloc does eventually purge unused pages back to the OS, but on its own schedule, not synchronously. This is the explicit trade-off: ~3x physical RSS for +56 % (1T) / +9 % (88T) on this benchmark. Users for whom that trade is wrong can `--no-default-features` to fall back to the system allocator.

sebpop · 2026-05-29T07:53:38Z

Curious to see how it impacts memory footprint as well.

The on-disk cost is ~155 KiB of embedded libmimalloc. At runtime, mimalloc holds onto free pages in per-thread heaps for reuse instead of returning them to the kernel after every encode_batch call; that cache is what makes the next call cheap, and what makes the steady-state RSS ~3x larger. mimalloc does eventually purge unused pages back to the OS, but on its own schedule, not synchronously.

This is the explicit trade-off: ~3x physical RSS for +56 % (1T) / +9 % (88T) on this benchmark. Users for whom that trade is wrong can --no-default-features to fall back to the system allocator.

sebpop · 2026-05-29T09:46:52Z

Tracing the aarch64-unknown-linux-gnu failure to:

cc1: error: -Werror=date-time: no option -Wdate-time

emitted by libmimalloc-sys@0.1.49's build script. This isn't an aarch64-vs-x86_64 issue; it's the manylinux2014 cross-toolchain that the maturin action pulls in (ghcr.io/rust-cross/manylinux2014-cross:aarch64 and the x86_64 equivalent). Both Linux GNU cross-compile jobs fail for the same reason; musllinux, native macOS, and Windows all pass.

manylinux2014 ships gcc that doesn't recognise -Wdate-time (the flag landed in gcc 4.9 but is missing here for unclear reasons). libmimalloc-sys 0.1.49 is the latest on crates.io — there's no upstream fix to bump to. The dep itself is mature; the incompat is purely with this specific cross-toolchain.

Two ways forward:

Bump the Linux build's manylinux baseline from auto (=2014) to manylinux_2_28 in python-release.yml (and the equivalent matrix entry in python.yml). manylinux_2_28 ships gcc 12, well past the -Wdate-time issue. The trade-off is dropping support for glibc < 2.28 in published wheels — that's RHEL/CentOS 7 era, which has been EOL'd. Most other PyO3-based projects (e.g. pydantic-core, cryptography, tiktoken) have already moved to 2_28 for this exact class of reasons.
Add a before-script-linux: yum install -y devtoolset-12-gcc; source /opt/rh/devtoolset-12/enable to the maturin-action invocation so manylinux2014 gets a newer compiler at build time. Keeps the glibc baseline but adds CI surface area.

I'd lean (1) since you've already moved the rest of the matrix off manylinux2010. Happy to send a follow-up PR with that workflow change once you say which path you prefer.

sebpop force-pushed the p4 branch from ba195a7 to aba42d0 Compare May 29, 2026 07:35

sebpop changed the title ~~bindings & bench: use mimalloc as global allocator in cdylib and bench binaries~~ bindings & bench: use mimalloc as global allocator on tested targets May 29, 2026

sebpop force-pushed the p4 branch from aba42d0 to 263ffc7 Compare May 29, 2026 07:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bindings & bench: use mimalloc as global allocator on tested targets#2073

bindings & bench: use mimalloc as global allocator on tested targets#2073
sebpop wants to merge 1 commit into
huggingface:mainfrom
sebpop:p4

sebpop commented May 26, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 28, 2026

Uh oh!

McPatate commented May 28, 2026

Uh oh!

sebpop commented May 29, 2026

Uh oh!

sebpop commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sebpop commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Restricted to tested targets

Throughput on tested platforms

Memory footprint

Uh oh!

HuggingFaceDocBuilderDev commented May 28, 2026

Uh oh!

McPatate commented May 28, 2026

Uh oh!

sebpop commented May 29, 2026

Uh oh!

sebpop commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sebpop commented May 26, 2026 •

edited

Loading