Multi-NUMA inference - tips and tricks, ideas, commiseration #19102

DocShotgun · 2026-01-26T06:58:01Z

DocShotgun
Jan 26, 2026

Multi-NUMA inference - tips and tricks, ideas, commiseration

Hello folks, I'd like to share my findings messing with all of the different knobs trying to achieve fast local inference on my dual socket homelab server over the past few months. Inviting all multi-NUMA homelab brothers and sisters to chime in with tips, tricks, suggestions, and the joy (pain) of multi-NUMA inference.

First off, I've got a couple of utility scripts to share:

numactl-bind-socket.sh - This script binds the following command to a single CPU socket (this is useful for hardware with more than one NUMA node per socket). There are additional options to specify whether to bind all cores (including hyperthreads) or physical cores only, and whether to enable memory interleave. For the purposes of llama.cpp, we generally want to interleave if we are using more than one NUMA node, otherwise it doesn't matter. The syntax is:

./numactl-bind-socket.sh --socket <id> --mode <physical|all> [--interleave <on|off>] <command> [args...]

disable-numa-balancing.sh - This script records the current NUMA-balancing state, disables NUMA balancing, then runs any command after it. On quit, it restores the previously recorded NUMA balancing state. The syntax is:

./disable-numa-balancing.sh <command> [args...]

These are my findings using a dual 8-channel 5th gen Xeon setup with a single RTX Pro 6000 Workstation Edition in CPU+GPU MoE inference:

Single socket inference

The best settings here seems to be --no-mmap, interleave + --numa distribute (if more than one node per socket). This has fair TG speed and allows for fast op offload to GPU for PP in CPU+GPU setups. Bind to node most proximal to main GPU. NUMA balancing disabled if more than one node per socket.

GGML_OP_OFFLOAD_MIN_BATCH=256 \
~/disable-numa-balancing.sh \
~/numactl-bind-socket.sh --socket 0 --mode all --interleave on \
~/llama.cpp/build/bin/llama-server \
-m /mnt/data/models/GGUF/DeepSeek-V3.2-dense-attn-Q8_0-Q4_K-Q4_K-Q5_K/DeepSeek-V3.2-dense-attn-Q8_0-Q4_K-Q4_K-Q5_K-00001-of-00010.gguf \
-c 32768 \
-ngl 999 \
-fa on \
-t 64 \
-b 4096 \
-ub 4096 \
--parallel 1 \
--no-mmap \
-fit off \
--numa distribute \
-ot "blk\.([0-9]|1[0-3])\.=CUDA0,exps=CPU"

On single node per socket setups, this would just be isolating to a single node most proximal to the GPU.

Either way we aren't using the hardware to its full potential.

TODO: Difference between --numa distribute and --numa numactl in this case when binding to 2 nodes?

Dual socket inference

For models that fit within the RAM of a single socket, using both sockets unfortunately seems to be generally less performant in my experience even though we have twice as many cores and channels available, due to cross-socket access. Open to any new tricks that anyone is aware of!

The best overall setup seems to be --no-mmap, interleave + --numa distribute across as many nodes as are needed to fit the model, with NUMA balancing disabled.

drop_caches + --mmap + --numa distribute trick (#16000 (comment)): This is interesting. Basically we drop the page cache, then mmap the model and use the warmup pass to "migrate" tensors to the appropriate node based on threads. It's able to achieve higher TG speed than single socket inference, even on models that are small enough to fit on a single socket. However, this seems to really hurt PP speeds during op offload to GPU - perhaps the weights are positioned awkwardly in memory so that it's less efficient to copy to GPU for op offload?

TODO: Second socket as RPC device? Curious if anyone has experience with this.

Repack buffers (i.e. AMX)

I haven't been able to find a good use for these in the CPU+GPU setting, but I suspect they are useful for CPU only inference. With --no-host, we can enable repack buffers to use in conjunction with GPU. However, this disables disaggregated op offload PP since there are no host buffers. Generally, this doesn't seem to be worthwhile to do because GPU PP is still faster than repack CPU PP.

Suggestions

Any observations and tips from your experience welcome. What future NUMA-specific optimizations would you like to see in llama.cpp?

ggerganov · 2026-01-26T07:46:16Z

ggerganov
Jan 26, 2026
Maintainer

I've been having an idea recently of virtualizing the backend devices and it might be the right approach to support NUMA - i.e. one CPU device/backend per NUMA socket. But it's really low priority for me and I might get back to this only after I am done with the Metal virtualization.

1 reply

DocShotgun Jan 26, 2026
Author

That sounds cool! Would these virtualized devices run sequentially or would they be able to parallelize? I'd suspect that turning multi-NUMA into sequential CPU devices may improve performance for large models/quants that don't fit within the RAM of a single socket, but probably wouldn't for smaller models. Hmm on a machine with multiple nodes per socket, multiple sequential devices might actually make things slower due to not using the whole socket at the same time.

DocShotgun · 2026-01-27T04:37:24Z

DocShotgun
Jan 27, 2026
Author

@fairydreaming @jukofyork Not sure if y'all had any different observations/optimizations to what I've noted here?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-NUMA inference - tips and tricks, ideas, commiseration #19102

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Multi-NUMA inference - tips and tricks, ideas, commiseration #19102

Uh oh!

DocShotgun Jan 26, 2026

Multi-NUMA inference - tips and tricks, ideas, commiseration

Single socket inference

Dual socket inference

Repack buffers (i.e. AMX)

Suggestions

Replies: 2 comments · 1 reply

Uh oh!

ggerganov Jan 26, 2026 Maintainer

Uh oh!

Uh oh!

DocShotgun Jan 26, 2026 Author

Uh oh!

DocShotgun Jan 27, 2026 Author

DocShotgun
Jan 26, 2026

Replies: 2 comments 1 reply

ggerganov
Jan 26, 2026
Maintainer

DocShotgun Jan 26, 2026
Author

DocShotgun
Jan 27, 2026
Author