bindings & bench: use mimalloc as global allocator on tested targets#2073
bindings & bench: use mimalloc as global allocator on tested targets#2073sebpop wants to merge 1 commit into
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Cool results! Could you measure the results on different permutations platforms and architecture? So x86, apple silicon, windows etc? Curious to see how it impacts memory footprint as well. I ran the CI, I'll let you take a look! Thank you for investigating! |
Profiles of `tokenizer.encode_batch(...)` show `malloc`/`free`/`realloc`
calls inside libc costing a measurable fraction of CPU cycles. At high
thread counts they emit cross-thread atomic operations on every free.
mimalloc has a thread-local free-list fast path (no atomics until a
cross-thread free is delayed) and is faster per call than glibc malloc.
`#[global_allocator]` is a process-wide setting; declaring it in a
library crate would make the choice global for every downstream Rust
binary that depends on `tokenizers`, conflict with any other
`#[global_allocator]` chosen by a consumer, and is broadly considered
bad practice. This patch therefore opts each *binary* into mimalloc
independently:
- `bindings/python/` (cdylib loaded by the Python interpreter) gains
an optional `mimalloc` feature, enabled by default, that registers
`mimalloc::MiMalloc` as the cdylib's global allocator.
- `bindings/node/` (cdylib loaded by Node) gains the same feature
and default.
- `tokenizers/benches/bpe_benchmark.rs` registers mimalloc as the
bench binary's global allocator via `mimalloc` in
`[dev-dependencies]`. This way `cargo bench` numbers reflect the
same allocator the production wheels ship with, without imposing
any allocator choice on library consumers (dev-dependencies do not
propagate).
The `tokenizers/` library crate (`src/lib.rs`, runtime feature set)
is intentionally unchanged.
# Restricted to tested targets
The first iteration of this patch broke CI on every target other than
musllinux and recent macOS, for two distinct reasons:
- libmimalloc-sys 0.1.49 passes `-Werror=date-time` unconditionally,
which fails on the older cross-compile gcc shipped by the
ubuntu-latest manylinux2014 builders for s390x, ppc64le, armv7,
riscv64, and 32-bit x86.
- MSVC linker mismatch on Windows: mimalloc's C objects use the static
CRT (`MT_StaticRelease`) while esaxx_rs uses the dynamic CRT
(`MD_DynamicRelease`), producing `LNK2038`.
Rather than work around each of those, the mimalloc dependency and
the `#[global_allocator]` declaration are now restricted by a
`cfg(...)` predicate to platforms where the patch author has measured
the change:
- `aarch64-unknown-linux-gnu`
- `x86_64-unknown-linux-gnu`
- `aarch64-apple-darwin`
On any other target the mimalloc crate is not pulled in (Cargo
honours `[target.cfg.dependencies]`) and the `#[global_allocator]`
line is `#[cfg(false)]`, so the cdylib falls back transparently to
the system allocator and the build proceeds without error. The
`mimalloc` feature stays in `default` features so users on tested
targets get the win automatically; users on tested targets who
want the system allocator can `--no-default-features`.
Community testing of additional triples is welcome. Adding a triple
requires extending the same `cfg(...)` predicate in three places
(both binding Cargo.toml's + lib.rs, and the bench Cargo.toml +
bpe_benchmark.rs); the expressions are kept identical and reference
each other in comments.
cargo test --lib --features http (tokenizers/, unchanged by this
PR): 201 passed, 0 failed.
# Throughput on tested platforms
`cargo bench --bench bpe_benchmark`, `bpe-encode/BPE GPT2 encode batch`
(`data/big.txt`, 6.5 MB through the full post-processor):
platform threads before after change
-------- ------- ------ ------ ------
Vera, NVIDIA, Linux aarch64, 88 physical cores 1T 3.96 MiB/s 6.18 MiB/s +56 %
88T 17.75 MiB/s 19.40 MiB/s +9.3 %
AMD EPYC 9124, Linux x86_64, 16 physical cores 1T 3.84 MiB/s 5.19 MiB/s +35 %
16T 23.40 MiB/s 42.72 MiB/s +83 %
Apple M3 Pro, macOS aarch64, 12 unified cores 1T 4.64 MiB/s 6.90 MiB/s +48 %
12T 17.53 MiB/s 34.60 MiB/s +97 %
The 1T wins (+35-56 %) are the same mechanism on every platform: per-
call malloc/free is on the critical path when there is no contention,
and mimalloc's single-thread fast path is roughly twice as cheap as
glibc / libsystem malloc.
The at-full-physical-cores wins differ by platform: large on EPYC
(+83 %) and M3 Pro (+97 %), modest on Vera (+9.3 %). On Vera the
`bpe_benchmark` hot path at 88T is bottlenecked by other
synchronisation primitives that this patch does not address; combined
with separate changes that target those primitives, mimalloc's
share-of-cycles win becomes a throughput win as well. Measured here
on top of `main` alone, the +9.3 % is honest.
Perf evidence on Vera (88 physical cores) at 88T,
`perf record -g --call-graph fp -F 4999` on a Python process calling
`tokenizer.encode_batch(docs, false)` in a loop for 15 s (`docs` is
`data/big.txt` split into 999 ~6.5 KB chunks). Wheels built
`-Ctarget-feature=+lse,+rcpc`:
symbol before after
libc malloc / cfree / realloc (combined) ~1.8 % ~0.13 % (-13x)
mimalloc family (mi_page_malloc, mi_free, ...) - ~0.96 %
std::thread::local::LocalKey<T>::with - 14.25 %
The libc malloc family drops by an order of magnitude. mimalloc's
fast path costs about half as much (~0.96 %) and brings in ~14 % of
cycles in `LocalKey::with`, the price of its thread-local heap.
# Memory footprint
The throughput win above comes from mimalloc caching freed memory in
thread-local pools rather than returning it immediately to the OS;
the same mechanism increases the steady-state resident size of the
Python process. Measured on Vera, same Python script that ran the
throughput sweep above, RSS read from `/proc/self/status` after 50
calls to `tokenizer.encode_batch(docs, false)`:
metric before after delta
------ ------ ------ ------
cdylib `.so` on disk 10.08 MiB 10.24 MiB +0.15 MiB
process VmRSS (1T) 237 MiB 704 MiB x2.97
process VmRSS (88T) 720 MiB 2 035 MiB x2.83
process VmPeak virtual (88T) 6.1 GiB 9.2 GiB x1.50
The on-disk cost is ~155 KiB of embedded libmimalloc. At runtime,
mimalloc holds onto free pages in per-thread heaps for reuse instead
of returning them to the kernel after every `encode_batch` call;
that cache is what makes the next call cheap, and what makes the
steady-state RSS ~3x larger. mimalloc does eventually purge unused
pages back to the OS, but on its own schedule, not synchronously.
This is the explicit trade-off: ~3x physical RSS for +56 % (1T) /
+9 % (88T) on this benchmark. Users for whom that trade is wrong
can `--no-default-features` to fall back to the system allocator.
The on-disk cost is ~155 KiB of embedded libmimalloc. At runtime, mimalloc holds onto free pages in per-thread heaps for reuse instead of returning them to the kernel after every encode_batch call; that cache is what makes the next call cheap, and what makes the steady-state RSS ~3x larger. mimalloc does eventually purge unused pages back to the OS, but on its own schedule, not synchronously. This is the explicit trade-off: ~3x physical RSS for +56 % (1T) / +9 % (88T) on this benchmark. Users for whom that trade is wrong can --no-default-features to fall back to the system allocator. |
|
Tracing the aarch64-unknown-linux-gnu failure to: emitted by libmimalloc-sys@0.1.49's build script. This isn't an aarch64-vs-x86_64 issue; it's the manylinux2014 cross-toolchain that the maturin action pulls in (ghcr.io/rust-cross/manylinux2014-cross:aarch64 and the x86_64 equivalent). Both Linux GNU cross-compile jobs fail for the same reason; musllinux, native macOS, and Windows all pass. manylinux2014 ships gcc that doesn't recognise -Wdate-time (the flag landed in gcc 4.9 but is missing here for unclear reasons). libmimalloc-sys 0.1.49 is the latest on crates.io — there's no upstream fix to bump to. The dep itself is mature; the incompat is purely with this specific cross-toolchain. Two ways forward:
I'd lean (1) since you've already moved the rest of the matrix off manylinux2010. Happy to send a follow-up PR with that workflow change once you say which path you prefer. |
Profiles of
tokenizer.encode_batch(...)showmalloc/free/realloccalls inside libc costing a measurable fraction of CPU cycles. At high
thread counts they emit cross-thread atomic operations on every free.
mimalloc has a thread-local free-list fast path (no atomics until a
cross-thread free is delayed) and is faster per call than glibc malloc.
#[global_allocator]is a process-wide setting; declaring it in alibrary crate would make the choice global for every downstream Rust
binary that depends on
tokenizers, conflict with any other#[global_allocator]chosen by a consumer, and is broadly consideredbad practice. This patch therefore opts each binary into mimalloc
independently:
bindings/python/(cdylib loaded by the Python interpreter) gainsan optional
mimallocfeature, enabled by default, that registersmimalloc::MiMallocas the cdylib's global allocator.bindings/node/(cdylib loaded by Node) gains the same featureand default.
tokenizers/benches/bpe_benchmark.rsregisters mimalloc as thebench binary's global allocator via
mimallocin[dev-dependencies]. This waycargo benchnumbers reflect thesame allocator the production wheels ship with, without imposing
any allocator choice on library consumers (dev-dependencies do not
propagate).
The
tokenizers/library crate (src/lib.rs, runtime feature set)is intentionally unchanged.
Restricted to tested targets
The first iteration of this patch broke CI on every target other than
musllinux and recent macOS, for two distinct reasons:
-Werror=date-timeunconditionally,which fails on the older cross-compile gcc shipped by the
ubuntu-latest manylinux2014 builders for s390x, ppc64le, armv7,
riscv64, and 32-bit x86.
CRT (
MT_StaticRelease) while esaxx_rs uses the dynamic CRT(
MD_DynamicRelease), producingLNK2038.Rather than work around each of those, the mimalloc dependency and
the
#[global_allocator]declaration are now restricted by acfg(...)predicate to platforms where the patch author has measuredthe change:
aarch64-unknown-linux-gnux86_64-unknown-linux-gnuaarch64-apple-darwinOn any other target the mimalloc crate is not pulled in (Cargo
honours
[target.cfg.dependencies]) and the#[global_allocator]line is
#[cfg(false)], so the cdylib falls back transparently tothe system allocator and the build proceeds without error. The
mimallocfeature stays indefaultfeatures so users on testedtargets get the win automatically; users on tested targets who
want the system allocator can
--no-default-features.Community testing of additional triples is welcome. Adding a triple
requires extending the same
cfg(...)predicate in three places(both binding Cargo.toml's + lib.rs, and the bench Cargo.toml +
bpe_benchmark.rs); the expressions are kept identical and reference
each other in comments.
Throughput on tested platforms
cargo bench --bench bpe_benchmark,bpe-encode/BPE GPT2 encode batch(
data/big.txt, 6.5 MB through the full post-processor):The 1T wins (+35-56 %) are the same mechanism on every platform: per-
call malloc/free is on the critical path when there is no contention,
and mimalloc's single-thread fast path is roughly twice as cheap as
glibc / libsystem malloc.
The at-full-physical-cores wins differ by platform: large on EPYC
(+83 %) and M3 Pro (+97 %), modest on Vera (+9.3 %). On Vera the
bpe_benchmarkhot path at 88T is bottlenecked by othersynchronisation primitives that this patch does not address; combined
with separate changes that target those primitives, mimalloc's
share-of-cycles win becomes a throughput win as well. Measured here
on top of
mainalone, the +9.3 % is honest.Perf evidence on Vera (88 physical cores) at 88T,
perf record -g --call-graph fp -F 4999on a Python process callingtokenizer.encode_batch(docs, false)in a loop for 15 s (docsisdata/big.txtsplit into 999 ~6.5 KB chunks). Wheels built-Ctarget-feature=+lse,+rcpc:The libc malloc family drops by an order of magnitude. mimalloc's
fast path costs about half as much (~0.96 %) and brings in ~14 % of
cycles in
LocalKey::with, the price of its thread-local heap.Memory footprint
The throughput win above comes from mimalloc caching freed memory in
thread-local pools rather than returning it immediately to the OS;
the same mechanism increases the steady-state resident size of the
Python process. Measured on Vera, same Python script that ran the
throughput sweep above, RSS read from
/proc/self/statusafter 50calls to
tokenizer.encode_batch(docs, false):The on-disk cost is ~155 KiB of embedded libmimalloc. At runtime,
mimalloc holds onto free pages in per-thread heaps for reuse instead
of returning them to the kernel after every
encode_batchcall;that cache is what makes the next call cheap, and what makes the
steady-state RSS ~3x larger. mimalloc does eventually purge unused
pages back to the OS, but on its own schedule, not synchronously.
This is the explicit trade-off: ~3x physical RSS for +56 % (1T) /
+9 % (88T) on this benchmark. Users for whom that trade is wrong
can
--no-default-featuresto fall back to the system allocator.