Multi-NUMA inference - tips and tricks, ideas, commiseration #19102
DocShotgun
started this conversation in
Ideas
Replies: 2 comments 1 reply
-
|
I've been having an idea recently of virtualizing the backend devices and it might be the right approach to support NUMA - i.e. one CPU device/backend per NUMA socket. But it's really low priority for me and I might get back to this only after I am done with the Metal virtualization. |
Beta Was this translation helpful? Give feedback.
1 reply
-
|
@fairydreaming @jukofyork Not sure if y'all had any different observations/optimizations to what I've noted here? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Multi-NUMA inference - tips and tricks, ideas, commiseration
Hello folks, I'd like to share my findings messing with all of the different knobs trying to achieve fast local inference on my dual socket homelab server over the past few months. Inviting all multi-NUMA homelab brothers and sisters to chime in with tips, tricks, suggestions, and the joy (pain) of multi-NUMA inference.
First off, I've got a couple of utility scripts to share:
numactl-bind-socket.sh - This script binds the following command to a single CPU socket (this is useful for hardware with more than one NUMA node per socket). There are additional options to specify whether to bind all cores (including hyperthreads) or physical cores only, and whether to enable memory interleave. For the purposes of llama.cpp, we generally want to interleave if we are using more than one NUMA node, otherwise it doesn't matter. The syntax is:
./numactl-bind-socket.sh --socket <id> --mode <physical|all> [--interleave <on|off>] <command> [args...]disable-numa-balancing.sh - This script records the current NUMA-balancing state, disables NUMA balancing, then runs any command after it. On quit, it restores the previously recorded NUMA balancing state. The syntax is:
./disable-numa-balancing.sh <command> [args...]These are my findings using a dual 8-channel 5th gen Xeon setup with a single RTX Pro 6000 Workstation Edition in CPU+GPU MoE inference:
Single socket inference
The best settings here seems to be
--no-mmap,interleave+--numa distribute(if more than one node per socket). This has fair TG speed and allows for fast op offload to GPU for PP in CPU+GPU setups. Bind to node most proximal to main GPU. NUMA balancing disabled if more than one node per socket.On single node per socket setups, this would just be isolating to a single node most proximal to the GPU.
Either way we aren't using the hardware to its full potential.
TODO: Difference between
--numa distributeand--numa numactlin this case when binding to 2 nodes?Dual socket inference
For models that fit within the RAM of a single socket, using both sockets unfortunately seems to be generally less performant in my experience even though we have twice as many cores and channels available, due to cross-socket access. Open to any new tricks that anyone is aware of!
The best overall setup seems to be
--no-mmap,interleave+--numa distributeacross as many nodes as are needed to fit the model, with NUMA balancing disabled.drop_caches+--mmap+--numa distributetrick (#16000 (comment)): This is interesting. Basically we drop the page cache, then mmap the model and use the warmup pass to "migrate" tensors to the appropriate node based on threads. It's able to achieve higher TG speed than single socket inference, even on models that are small enough to fit on a single socket. However, this seems to really hurt PP speeds during op offload to GPU - perhaps the weights are positioned awkwardly in memory so that it's less efficient to copy to GPU for op offload?TODO: Second socket as RPC device? Curious if anyone has experience with this.
Repack buffers (i.e. AMX)
I haven't been able to find a good use for these in the CPU+GPU setting, but I suspect they are useful for CPU only inference. With
--no-host, we can enable repack buffers to use in conjunction with GPU. However, this disables disaggregated op offload PP since there are no host buffers. Generally, this doesn't seem to be worthwhile to do because GPU PP is still faster than repack CPU PP.Suggestions
Any observations and tips from your experience welcome. What future NUMA-specific optimizations would you like to see in llama.cpp?
Beta Was this translation helpful? Give feedback.
All reactions