Skip to content

bindings & bench: use mimalloc as global allocator on tested targets#2073

Open
sebpop wants to merge 1 commit into
huggingface:mainfrom
sebpop:p4
Open

bindings & bench: use mimalloc as global allocator on tested targets#2073
sebpop wants to merge 1 commit into
huggingface:mainfrom
sebpop:p4

Conversation

@sebpop
Copy link
Copy Markdown
Contributor

@sebpop sebpop commented May 26, 2026

Profiles of tokenizer.encode_batch(...) show malloc/free/realloc
calls inside libc costing a measurable fraction of CPU cycles. At high
thread counts they emit cross-thread atomic operations on every free.
mimalloc has a thread-local free-list fast path (no atomics until a
cross-thread free is delayed) and is faster per call than glibc malloc.

#[global_allocator] is a process-wide setting; declaring it in a
library crate would make the choice global for every downstream Rust
binary that depends on tokenizers, conflict with any other
#[global_allocator] chosen by a consumer, and is broadly considered
bad practice. This patch therefore opts each binary into mimalloc
independently:

  • bindings/python/ (cdylib loaded by the Python interpreter) gains
    an optional mimalloc feature, enabled by default, that registers
    mimalloc::MiMalloc as the cdylib's global allocator.
  • bindings/node/ (cdylib loaded by Node) gains the same feature
    and default.
  • tokenizers/benches/bpe_benchmark.rs registers mimalloc as the
    bench binary's global allocator via mimalloc in
    [dev-dependencies]. This way cargo bench numbers reflect the
    same allocator the production wheels ship with, without imposing
    any allocator choice on library consumers (dev-dependencies do not
    propagate).

The tokenizers/ library crate (src/lib.rs, runtime feature set)
is intentionally unchanged.

Restricted to tested targets

The first iteration of this patch broke CI on every target other than
musllinux and recent macOS, for two distinct reasons:

  • libmimalloc-sys 0.1.49 passes -Werror=date-time unconditionally,
    which fails on the older cross-compile gcc shipped by the
    ubuntu-latest manylinux2014 builders for s390x, ppc64le, armv7,
    riscv64, and 32-bit x86.
  • MSVC linker mismatch on Windows: mimalloc's C objects use the static
    CRT (MT_StaticRelease) while esaxx_rs uses the dynamic CRT
    (MD_DynamicRelease), producing LNK2038.

Rather than work around each of those, the mimalloc dependency and
the #[global_allocator] declaration are now restricted by a
cfg(...) predicate to platforms where the patch author has measured
the change:

  • aarch64-unknown-linux-gnu
  • x86_64-unknown-linux-gnu
  • aarch64-apple-darwin

On any other target the mimalloc crate is not pulled in (Cargo
honours [target.cfg.dependencies]) and the #[global_allocator]
line is #[cfg(false)], so the cdylib falls back transparently to
the system allocator and the build proceeds without error. The
mimalloc feature stays in default features so users on tested
targets get the win automatically; users on tested targets who
want the system allocator can --no-default-features.

Community testing of additional triples is welcome. Adding a triple
requires extending the same cfg(...) predicate in three places
(both binding Cargo.toml's + lib.rs, and the bench Cargo.toml +
bpe_benchmark.rs); the expressions are kept identical and reference
each other in comments.

cargo test --lib --features http 
 (tokenizers/, unchanged by this PR)
 201 passed, 0 failed.

Throughput on tested platforms

cargo bench --bench bpe_benchmark, bpe-encode/BPE GPT2 encode batch
(data/big.txt, 6.5 MB through the full post-processor):

  platform                                              threads  before        after         change
  --------                                              -------  ------        ------        ------
  Vera, NVIDIA, Linux aarch64, 88 physical cores         1T      3.96 MiB/s    6.18 MiB/s    +56 %
                                                         88T     17.75 MiB/s   19.40 MiB/s   +9.3 %
  AMD EPYC 9124, Linux x86_64, 16 physical cores         1T      3.84 MiB/s    5.19 MiB/s    +35 %
                                                         16T     23.40 MiB/s   42.72 MiB/s   +83 %
  Apple M3 Pro, macOS aarch64, 12 unified cores          1T      4.64 MiB/s    6.90 MiB/s    +48 %
                                                         12T     17.53 MiB/s   34.60 MiB/s   +97 %

The 1T wins (+35-56 %) are the same mechanism on every platform: per-
call malloc/free is on the critical path when there is no contention,
and mimalloc's single-thread fast path is roughly twice as cheap as
glibc / libsystem malloc.

The at-full-physical-cores wins differ by platform: large on EPYC
(+83 %) and M3 Pro (+97 %), modest on Vera (+9.3 %). On Vera the
bpe_benchmark hot path at 88T is bottlenecked by other
synchronisation primitives that this patch does not address; combined
with separate changes that target those primitives, mimalloc's
share-of-cycles win becomes a throughput win as well. Measured here
on top of main alone, the +9.3 % is honest.

Perf evidence on Vera (88 physical cores) at 88T,
perf record -g --call-graph fp -F 4999 on a Python process calling
tokenizer.encode_batch(docs, false) in a loop for 15 s (docs is
data/big.txt split into 999 ~6.5 KB chunks). Wheels built
-Ctarget-feature=+lse,+rcpc:

  symbol                                         before     after
  libc malloc / cfree / realloc (combined)       ~1.8 %      ~0.13 %  (-13x)
  mimalloc family (mi_page_malloc, mi_free, ...) -           ~0.96 %
  std::thread::local::LocalKey<T>::with          -          14.25 %

The libc malloc family drops by an order of magnitude. mimalloc's
fast path costs about half as much (~0.96 %) and brings in ~14 % of
cycles in LocalKey::with, the price of its thread-local heap.

Memory footprint

The throughput win above comes from mimalloc caching freed memory in
thread-local pools rather than returning it immediately to the OS;
the same mechanism increases the steady-state resident size of the
Python process. Measured on Vera, same Python script that ran the
throughput sweep above, RSS read from /proc/self/status after 50
calls to tokenizer.encode_batch(docs, false):

  metric                          before        after         delta
  ------                          ------        ------        ------
  cdylib `.so` on disk            10.08 MiB     10.24 MiB     +0.15 MiB
  process VmRSS (1T)              237 MiB       704 MiB       x2.97
  process VmRSS (88T)             720 MiB       2 035 MiB     x2.83
  process VmPeak virtual (88T)    6.1 GiB       9.2 GiB       x1.50

The on-disk cost is ~155 KiB of embedded libmimalloc. At runtime,
mimalloc holds onto free pages in per-thread heaps for reuse instead
of returning them to the kernel after every encode_batch call;
that cache is what makes the next call cheap, and what makes the
steady-state RSS ~3x larger. mimalloc does eventually purge unused
pages back to the OS, but on its own schedule, not synchronously.

This is the explicit trade-off: ~3x physical RSS for +56 % (1T) /
+9 % (88T) on this benchmark. Users for whom that trade is wrong
can --no-default-features to fall back to the system allocator.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@McPatate
Copy link
Copy Markdown
Member

Cool results! Could you measure the results on different permutations platforms and architecture? So x86, apple silicon, windows etc?

Curious to see how it impacts memory footprint as well.

I ran the CI, I'll let you take a look! Thank you for investigating!

@sebpop sebpop changed the title bindings & bench: use mimalloc as global allocator in cdylib and bench binaries bindings & bench: use mimalloc as global allocator on tested targets May 29, 2026
Profiles of `tokenizer.encode_batch(...)` show `malloc`/`free`/`realloc`
calls inside libc costing a measurable fraction of CPU cycles.  At high
thread counts they emit cross-thread atomic operations on every free.
mimalloc has a thread-local free-list fast path (no atomics until a
cross-thread free is delayed) and is faster per call than glibc malloc.

`#[global_allocator]` is a process-wide setting; declaring it in a
library crate would make the choice global for every downstream Rust
binary that depends on `tokenizers`, conflict with any other
`#[global_allocator]` chosen by a consumer, and is broadly considered
bad practice.  This patch therefore opts each *binary* into mimalloc
independently:

- `bindings/python/` (cdylib loaded by the Python interpreter) gains
  an optional `mimalloc` feature, enabled by default, that registers
  `mimalloc::MiMalloc` as the cdylib's global allocator.
- `bindings/node/` (cdylib loaded by Node) gains the same feature
  and default.
- `tokenizers/benches/bpe_benchmark.rs` registers mimalloc as the
  bench binary's global allocator via `mimalloc` in
  `[dev-dependencies]`.  This way `cargo bench` numbers reflect the
  same allocator the production wheels ship with, without imposing
  any allocator choice on library consumers (dev-dependencies do not
  propagate).

The `tokenizers/` library crate (`src/lib.rs`, runtime feature set)
is intentionally unchanged.

# Restricted to tested targets

The first iteration of this patch broke CI on every target other than
musllinux and recent macOS, for two distinct reasons:

- libmimalloc-sys 0.1.49 passes `-Werror=date-time` unconditionally,
  which fails on the older cross-compile gcc shipped by the
  ubuntu-latest manylinux2014 builders for s390x, ppc64le, armv7,
  riscv64, and 32-bit x86.
- MSVC linker mismatch on Windows: mimalloc's C objects use the static
  CRT (`MT_StaticRelease`) while esaxx_rs uses the dynamic CRT
  (`MD_DynamicRelease`), producing `LNK2038`.

Rather than work around each of those, the mimalloc dependency and
the `#[global_allocator]` declaration are now restricted by a
`cfg(...)` predicate to platforms where the patch author has measured
the change:

  - `aarch64-unknown-linux-gnu`
  - `x86_64-unknown-linux-gnu`
  - `aarch64-apple-darwin`

On any other target the mimalloc crate is not pulled in (Cargo
honours `[target.cfg.dependencies]`) and the `#[global_allocator]`
line is `#[cfg(false)]`, so the cdylib falls back transparently to
the system allocator and the build proceeds without error.  The
`mimalloc` feature stays in `default` features so users on tested
targets get the win automatically; users on tested targets who
want the system allocator can `--no-default-features`.

Community testing of additional triples is welcome.  Adding a triple
requires extending the same `cfg(...)` predicate in three places
(both binding Cargo.toml's + lib.rs, and the bench Cargo.toml +
bpe_benchmark.rs); the expressions are kept identical and reference
each other in comments.

cargo test --lib --features http (tokenizers/, unchanged by this
PR): 201 passed, 0 failed.

# Throughput on tested platforms

`cargo bench --bench bpe_benchmark`, `bpe-encode/BPE GPT2 encode batch`
(`data/big.txt`, 6.5 MB through the full post-processor):

  platform                                              threads  before        after         change
  --------                                              -------  ------        ------        ------
  Vera, NVIDIA, Linux aarch64, 88 physical cores         1T      3.96 MiB/s    6.18 MiB/s    +56 %
                                                         88T     17.75 MiB/s   19.40 MiB/s   +9.3 %
  AMD EPYC 9124, Linux x86_64, 16 physical cores         1T      3.84 MiB/s    5.19 MiB/s    +35 %
                                                         16T     23.40 MiB/s   42.72 MiB/s   +83 %
  Apple M3 Pro, macOS aarch64, 12 unified cores          1T      4.64 MiB/s    6.90 MiB/s    +48 %
                                                         12T     17.53 MiB/s   34.60 MiB/s   +97 %

The 1T wins (+35-56 %) are the same mechanism on every platform: per-
call malloc/free is on the critical path when there is no contention,
and mimalloc's single-thread fast path is roughly twice as cheap as
glibc / libsystem malloc.

The at-full-physical-cores wins differ by platform: large on EPYC
(+83 %) and M3 Pro (+97 %), modest on Vera (+9.3 %).  On Vera the
`bpe_benchmark` hot path at 88T is bottlenecked by other
synchronisation primitives that this patch does not address; combined
with separate changes that target those primitives, mimalloc's
share-of-cycles win becomes a throughput win as well.  Measured here
on top of `main` alone, the +9.3 % is honest.

Perf evidence on Vera (88 physical cores) at 88T,
`perf record -g --call-graph fp -F 4999` on a Python process calling
`tokenizer.encode_batch(docs, false)` in a loop for 15 s (`docs` is
`data/big.txt` split into 999 ~6.5 KB chunks).  Wheels built
`-Ctarget-feature=+lse,+rcpc`:

  symbol                                         before     after
  libc malloc / cfree / realloc (combined)       ~1.8 %      ~0.13 %  (-13x)
  mimalloc family (mi_page_malloc, mi_free, ...) -           ~0.96 %
  std::thread::local::LocalKey<T>::with          -          14.25 %

The libc malloc family drops by an order of magnitude.  mimalloc's
fast path costs about half as much (~0.96 %) and brings in ~14 % of
cycles in `LocalKey::with`, the price of its thread-local heap.

# Memory footprint

The throughput win above comes from mimalloc caching freed memory in
thread-local pools rather than returning it immediately to the OS;
the same mechanism increases the steady-state resident size of the
Python process.  Measured on Vera, same Python script that ran the
throughput sweep above, RSS read from `/proc/self/status` after 50
calls to `tokenizer.encode_batch(docs, false)`:

  metric                          before        after         delta
  ------                          ------        ------        ------
  cdylib `.so` on disk            10.08 MiB     10.24 MiB     +0.15 MiB
  process VmRSS (1T)              237 MiB       704 MiB       x2.97
  process VmRSS (88T)             720 MiB       2 035 MiB     x2.83
  process VmPeak virtual (88T)    6.1 GiB       9.2 GiB       x1.50

The on-disk cost is ~155 KiB of embedded libmimalloc.  At runtime,
mimalloc holds onto free pages in per-thread heaps for reuse instead
of returning them to the kernel after every `encode_batch` call;
that cache is what makes the next call cheap, and what makes the
steady-state RSS ~3x larger.  mimalloc does eventually purge unused
pages back to the OS, but on its own schedule, not synchronously.

This is the explicit trade-off: ~3x physical RSS for +56 % (1T) /
+9 % (88T) on this benchmark.  Users for whom that trade is wrong
can `--no-default-features` to fall back to the system allocator.
@sebpop
Copy link
Copy Markdown
Contributor Author

sebpop commented May 29, 2026

Curious to see how it impacts memory footprint as well.

The on-disk cost is ~155 KiB of embedded libmimalloc. At runtime, mimalloc holds onto free pages in per-thread heaps for reuse instead of returning them to the kernel after every encode_batch call; that cache is what makes the next call cheap, and what makes the steady-state RSS ~3x larger. mimalloc does eventually purge unused pages back to the OS, but on its own schedule, not synchronously.

This is the explicit trade-off: ~3x physical RSS for +56 % (1T) / +9 % (88T) on this benchmark. Users for whom that trade is wrong can --no-default-features to fall back to the system allocator.

@sebpop
Copy link
Copy Markdown
Contributor Author

sebpop commented May 29, 2026

Tracing the aarch64-unknown-linux-gnu failure to:

cc1: error: -Werror=date-time: no option -Wdate-time

emitted by libmimalloc-sys@0.1.49's build script. This isn't an aarch64-vs-x86_64 issue; it's the manylinux2014 cross-toolchain that the maturin action pulls in (ghcr.io/rust-cross/manylinux2014-cross:aarch64 and the x86_64 equivalent). Both Linux GNU cross-compile jobs fail for the same reason; musllinux, native macOS, and Windows all pass.

manylinux2014 ships gcc that doesn't recognise -Wdate-time (the flag landed in gcc 4.9 but is missing here for unclear reasons). libmimalloc-sys 0.1.49 is the latest on crates.io — there's no upstream fix to bump to. The dep itself is mature; the incompat is purely with this specific cross-toolchain.

Two ways forward:

  1. Bump the Linux build's manylinux baseline from auto (=2014) to manylinux_2_28 in python-release.yml (and the equivalent matrix entry in python.yml). manylinux_2_28 ships gcc 12, well past the -Wdate-time issue. The trade-off is dropping support for glibc < 2.28 in published wheels — that's RHEL/CentOS 7 era, which has been EOL'd. Most other PyO3-based projects (e.g. pydantic-core, cryptography, tiktoken) have already moved to 2_28 for this exact class of reasons.

  2. Add a before-script-linux: yum install -y devtoolset-12-gcc; source /opt/rh/devtoolset-12/enable to the maturin-action invocation so manylinux2014 gets a newer compiler at build time. Keeps the glibc baseline but adds CI surface area.

I'd lean (1) since you've already moved the rest of the matrix off manylinux2010. Happy to send a follow-up PR with that workflow change once you say which path you prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants