llama.cpp on AI Max 395+ with shared memory #18839

TheDarkTrumpet · 2026-01-14T13:15:57Z

TheDarkTrumpet
Jan 14, 2026

Good morning,

I've been trying to setup this Framework linux box properly and found that - well some stuff doesn't make sense.

In short, I'm trying to run two 30B Qwen A3B models into memory, with different instances of llama.cpp. Prior to doing all this, I set up the system so that there's unified memory (so I use the full 128GB of available memory). It's there, and I'm sure it's working, but some of the numbers simply don't make sense. Loading the first model, a 30B coding A3B model:

ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon 8060S, gfx1151 (0x1151), VMM: no, Wave Size: 32
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 8060S (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 7716 (bcf754616) with Clang 20.0.0 for Linux 
system info: n_threads = 16, n_threads_batch = 16, total_threads = 32

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : LLAMAFILE = 1 | REPACK = 1 | 

init: using 31 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/data/Qwen3-Coder-30B-A3B-Instruct-Q5_K_M.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 32908 MiB of device memory vs. 123342 MiB of free device memory
llama_params_fit_impl: will leave 90433 >= 1024 MiB of free device memory, no changes needed

The backend is vulkan - the weird part I see is "shared memory: 65536", which is very odd. This is launched when nothing else is running.

I then try to load the second instance (Qwen VL A3B 30B), and here's what I'm seeing:

llama_cpp_qwen3vl-1  | ggml_cuda_init: found 1 ROCm devices:
llama_cpp_qwen3vl-1  |   Device 0: AMD Radeon 8060S, gfx1151 (0x1151), VMM: no, Wave Size: 32
llama_cpp_qwen3vl-1  | ggml_vulkan: Found 1 Vulkan devices:
llama_cpp_qwen3vl-1  | ggml_vulkan: 0 = AMD Radeon 8060S (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
llama_cpp_qwen3vl-1  | main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
llama_cpp_qwen3vl-1  | build: 7716 (bcf754616) with Clang 20.0.0 for Linux 
llama_cpp_qwen3vl-1  | system info: n_threads = 16, n_threads_batch = 16, total_threads = 32
llama_cpp_qwen3vl-1  | 
llama_cpp_qwen3vl-1  | system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : LLAMAFILE = 1 | REPACK = 1 | 
llama_cpp_qwen3vl-1  | 
llama_cpp_qwen3vl-1  | init: using 31 threads for HTTP server
llama_cpp_qwen3vl-1  | start: binding port with default address family
llama_cpp_qwen3vl-1  | main: loading model
llama_cpp_qwen3vl-1  | srv    load_model: loading model '/data/Qwen3VL-30B-A3B-Thinking-Q8_0.gguf'
llama_cpp_qwen3vl-1  | common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_cpp_qwen3vl-1  | llama_params_fit_impl: projected to use 36958 MiB of device memory vs. 89387 MiB of free device memory
llama_cpp_qwen3vl-1  | llama_params_fit_impl: will leave 52428 >= 1024 MiB of free device memory, no changes needed
llama_cpp_qwen3vl-1  | llama_params_fit: successfully fit params to free device memory
llama_cpp_qwen3vl-1  | llama_params_fit: fitting params to free memory took 0.38 seconds
llama_cpp_qwen3vl-1  | llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon 8060S) (0000:c2:00.0) - 89387 MiB free
llama_cpp_qwen3vl-1  | llama_model_loader: loaded meta data with 35 key-value pairs and 579 tensors from /data/Qwen3VL-30B-A3B-Thinking-Q8_0.gguf (version GGUF V3 (latest))

Similar information, shared memory still shows 65k. Projected use is lower than free space as expected. Says it'll leave ~52Gb after load. But, doesn't load, with an eventual:

llama_cpp_qwen3vl-1  | load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = true)
llama_cpp_qwen3vl-1  | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 30658.12 MiB on device 0: cudaMalloc failed: out of memory
llama_cpp_qwen3vl-1  | alloc_tensor_range: failed to allocate ROCm0 buffer of size 32147367936
llama_cpp_qwen3vl-1  | llama_model_load: error loading model: unable to allocate ROCm0 buffer
llama_cpp_qwen3vl-1  | llama_model_load_from_file_impl: failed to load model
llama_cpp_qwen3vl-1  | common_init_from_params: failed to load model '/data/Qwen3VL-30B-A3B-Thinking-Q8_0.gguf'
llama_cpp_qwen3vl-1  | srv    load_model: failed to load model, '/data/Qwen3VL-30B-A3B-Thinking-Q8_0.gguf'
llama_cpp_qwen3vl-1  | srv    operator(): operator(): cleaning up before exit...
llama_cpp_qwen3vl-1  | main: exiting due to model loading error
llama_cpp_qwen3vl-1 exited with code 1

I'm building llama from source, through a Dockerfile. I have versions for both vulkan and non-vulkan alike. Contents below:

FROM rocm/dev-ubuntu-24.04:7.1.1-complete
#ENV HIP_PATH=/opt/rocm
#ENV ROCM_PATH=/opt/rocm
#ENV HIP_PLATFORM=amd
#ENV HIP_CLANG_PATH=/opt/rocm/llvm/bin
#ENV HIP_INCLUDE_PATH=/opt/rocm/include
#ENV HIP_LIB_PATH=/opt/rocm/lib
#ENV HIP_DEVICE_LIB_PATH=/opt/rocm/lib/llvm/amdgcn/bitcode
ENV PATH=/opt/rocm/bin:/opt/rocm/llvm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
ENV LD_LIBRARY_PATH=/opt/rocm/lib:/opt/rocm/lib64:/opt/rocm/llvm/lib
ENV LIBRARY_PATH=/opt/rocm/lib:/opt/rocm/lib64
ENV CPATH=/opt/rocm/include
ENV PKG_CONFIG_PATH=/opt/rocm/lib/pkgconfig

RUN apt-get update && apt-get install -y git cmake ninja-build wget ccache
# Vulkan Test
RUN apt-get install -y libvulkan1 mesa-vulkan-drivers libglvnd0 libgl1 libglx0 libegl1 libgles2 libgomp1 libvulkan-dev glslc libshaderc-dev

RUN mkdir /build && cd /build && git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && mkdir build
RUN cd /build/llama.cpp/build && cmake .. -G Ninja \
  -DCMAKE_C_COMPILER=/opt/rocm/llvm/bin/clang \
  -DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++ \
  -DCMAKE_CXX_FLAGS="-I/opt/rocm/include" \
#  -DCMAKE_CROSSCOMPILING=ON \
  -DGGML_HIP_UMA=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DGPU_TARGETS="gfx1151" \
  -DBUILD_SHARED_LIBS=ON \
  -DLLAMA_BUILD_TESTS=OFF \
  -DGGML_HIP=ON \
  -DGGML_OPENMP=OFF \
  -DGGML_CUDA_FORCE_CUBLAS=OFF \
  -DGGML_HIP_ROCWMMA_FATTN=ON \
  -DLLAMA_CURL=OFF \
  -DGGML_NATIVE=OFF \
  -DGGML_STATIC=OFF \
  -DCMAKE_SYSTEM_NAME=Linux \
  -DGGML_VULKAN=ON \
  -DDML_RPC=ON

RUN cd /build/llama.cpp/build && cmake --build . -j $(nproc)

RUN ln -s /build/llama.cpp/build/bin /llama-cpp

WORKDIR /llama-cpp
ENTRYPOINT [ "/llama-cpp/llama-server" ]
EXPOSE 8080/tcp

The numbers don't make much sense to me. Is anyone aware of any workarounds? Apparently others have had similar issues, and pull requests have been done to deal with reporting.

Answered by d-shehu

Jan 16, 2026

I ran into a similar issue while trying to use my iGPU on my AMD system. I modified the page limit and pool sizes and I was able to use roughly 110GB for llama.cpp out of 125 GB available (2-3GB is reserved for I/O).

GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=on iommu=pt kvm.ignore_msrs=1 amdgpu.gttsize=110592 amdttm.page_pool_size=27648000 amdttm.pages_limit=27648000 ttm.pages_limit=27648000 ttm.page_pool_size=27648000"

View full answer

0cc4m · 2026-01-14T13:32:33Z

0cc4m
Jan 14, 2026
Collaborator

The shared memory number refers to the amount of memory in bytes your GPU can share inside of a workgroup, it is not related to VRAM. However, the actual available VRAM number is also shown by the server, for example:

llama_model_load_from_file_impl: using device Vulkan0 (Intel(R) Arc(tm) Graphics (MTL)) (0000:00:02.0) - 43011 MiB free

9 replies

0cc4m Jan 14, 2026
Collaborator

You did set it up correctly for Vulkan, as your log shows:

llama_params_fit_impl: projected to use 32908 MiB of device memory vs. 123342 MiB of free device memory

This means you have 123 GB VRAM free.

TheDarkTrumpet Jan 14, 2026
Author

Thanks for the response on this. I'm glad to know it's reporting right. Is there any reason why, at the limited context I'm using for each model (128k each), that it's failing to load both models in, or is there anything I can provide that'd assist?

0cc4m Jan 15, 2026
Collaborator

I'm not an expert on ROCm, I think if you used Vulkan twice it should be working. For ROCm, did you build with GGML_CUDA_ENABLE_UNIFIED_MEMORY? I think that might be required for using shared RAM on APUs like your 8060S.

d-shehu Jan 16, 2026

I ran into a similar issue while trying to use my iGPU on my AMD system. I modified the page limit and pool sizes and I was able to use roughly 110GB for llama.cpp out of 125 GB available (2-3GB is reserved for I/O).

GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=on iommu=pt kvm.ignore_msrs=1 amdgpu.gttsize=110592 amdttm.page_pool_size=27648000 amdttm.pages_limit=27648000 ttm.pages_limit=27648000 ttm.page_pool_size=27648000"

Answer selected by TheDarkTrumpet

TheDarkTrumpet Jan 16, 2026
Author

Were you able to load two instances of llama.cpp given your configuration? E.g. two models at the same time, without issues?

Checking mine, there's some difference in our setup. I have the iommu off, and no setting for that, my page limits are higher than yours as well. But the headroom since I'm running docker and other services may be wise. I'll try this setup and see if it improves issues.

d-shehu Jan 16, 2026

Yes, I could run multiple instances simultaneously of llama.cpp programs, namely llama-server, llama-cli and llama-rpc. One instance is the RPC server and the other llama-server which allows me to connect from one to another, or flip between them.

The llama.cpp binaries are containerized and I passthrough the devices.

devices:
      - /dev/dri:/dev/dri

I've always had amd_iommu for unrelated projects. I don't think it should affect things, right?

Btw, I'm testing with AM4 and AM5 systems running Ryzen and not 395+. But I'm interested in getting a 395 eventually.

TheDarkTrumpet Jan 16, 2026
Author

Ironically, I think the amd_iommu is what more or less did it, at least that's my suspicion. I may be wrong, though. I may for different iterations to see what happens, and this is absolutely bonkers in my view. I need to learn more.

Not changing anything else (no extra containers, same models with the same context size and all that?), the only thing I changed is going from:

GRUB_CMDLINE_LINUX_DEFAULT="amdgpu.runpm=0 amdgpu.ppfeaturemask=0xffffffff pcie_aspm=off amdgpu.dpm=1 amdgpu.dc=1 amd_iommu=off amdgpu.gttsize=131072 ttm.pages_limit=33554432"

To....

GRUB_CMDLINE_LINUX_DEFAULT="amdgpu.runpm=0 amdgpu.ppfeaturemask=0xffffffff pcie_aspm=off amdgpu.dpm=1 amdgpu.dc=1 amd_iommu=on iommu=pt kvm.ignore_msrs=1 amdgpu.gttsize=110592 amdttm.page_pool_size=27648000 amdttm.pages_limit=27648000 ttm.pages_limit=27648000 ttm.page_pool_size=27648000"

The extra stuff I have at the beginning is to more or less keep it from switching between low/high power usage. Wanting it to be as responsive as possible.

Some takeaways though. The main difference being:

you have the whole iommu thing going on.
There's a decrease in the gttsize, page, and so on compared to me.

I haven't had any memory issues using the full RAM, so I'm not really confident that is much of a difference. I'm about to update grub and try again using more or less my configuration + your iommu to see if that works. If it does, then it has to be the iommu. I admit, I have no clue what that actually does though for any of this. That said, both are physically loaded and running.

+------------------------------------------------------------------------------+
| AMD-SMI 26.1.0+5df6c765      amdgpu version: 6.16.6   ROCm version: 7.1.0    |
| VBIOS version: 00107962                                                      |
| Platform: Linux Baremetal                                                    |
|-------------------------------------+----------------------------------------|
| BDF                        GPU-Name | Mem-Uti   Temp   UEC       Power-Usage |
| GPU  HIP-ID  OAM-ID  Partition-Mode | GFX-Uti    Fan               Mem-Usage |
|=====================================+========================================|
| 0000:c2:00.0    AMD Radeon Graphics | N/A        N/A   0             N/A/0 W |
|   0       0     N/A             N/A | N/A        N/A              156/512 MB |
+-------------------------------------+----------------------------------------+
+------------------------------------------------------------------------------+
| Processes:                                                                   |
|  GPU        PID  Process Name          GTT_MEM  VRAM_MEM  MEM_USAGE     CU % |
|==============================================================================|
|    0      11596  N/A                     0.0 B   32.3 GB    32.3 GB  N/A     |
|    0      11911  N/A                     0.0 B   36.2 GB    36.2 GB  N/A     |
+------------------------------------------------------------------------------+

TheDarkTrumpet Jan 17, 2026
Author

Alright, got final results. I don't know if this is worth a dedicated help article in llama or elsewhere, but here's what I found out.

First the iommu part wasn't the fix. That said, some searching points to better speeds using it than not, so I kept it around. Using all as unified memory didn't work. I'm not 100% sure of it, but played with multiple iterations of it, and settled on about 126Gb of the 128Gb of memory working out okay.

The command line options I'm using now is:

GRUB_CMDLINE_LINUX_DEFAULT="amdgpu.runpm=0 amdgpu.ppfeaturemask=0xffffffff pcie_aspm=off amdgpu.dpm=1 amdgpu.dc=1 amd_iommu=on iommu=pt kvm.ignore_msrs=1 amdgpu.gttsize=129024 ttm.pages_limit=33030144 ttm.page_pool_size=33030144 amdttm.page_pool_size=33030144 amdttm.pages_limit=33030144"

I don't think it's related to OS overhead or anything, given it's unified memory and supposed to be shared between gpu/system. Which makes me think maybe it's something to do with the minimum one can set for the iGPU portion in the BIOS. Not really sure.

Either way, the above works fine, both models running again just fine. Squeezes a bit more out of it.

Since documentation is a bit all over the place on this, some mention of this in the docs here may be helpful to others. @0cc4m - since you're a contributor, i'd like your input please. Do you feel like this space would be appropriate for it? I don't mind doing a pull request this weekend to document a bit around this situation.

rene-descartes2021 · 2026-01-22T00:59:34Z

rene-descartes2021
Jan 22, 2026

In my troubleshooting trying to use an AMD APU and unified memory (not GTT), I found that the amdgpu driver relies on the motherboard manufacturer to populate the CRAT table. If it's invalid it will fall back to a "virtual" population with CPU and dGPU nodes not an APU node (I don't have a dGPU but it sets it up that way as a fallback). Following this the amdgpu driver won't use unified memory for me as far as I can figure, as it's not seeing an APU in the first place... Sorry I don't know all terminology and maybe conceptualised things wrong. Using the vulkan backend and the memory through GTT [1][2] works for me but I suspect the performance not being possibly as good as either XNACK or other subsequent means of unified memory they've experimented with (out of my depth there)?

Dunno if your situation is similar.
Here is what I see as example, maybe yours is similar?:

$ sudo dmesg | grep CRAT
[   11.101748] amdgpu: Virtual CRAT table created for CPU
[   11.984566] amdgpu: Virtual CRAT table created for GPU
$ sudo dmesg | grep "amdgpu: Topology:"
[   11.101761] amdgpu: Topology: Add CPU node
[   11.984625] amdgpu: Topology: Add dGPU node [0x1636:0x1002]

Solutions may be to fix the ACPI CRAT table, the virtual CRAT table population algorithm, or hope the motherboard manufacturer fixes their table in a BIOS update.

0 replies

llama.cpp on AI Max 395+ with shared memory #18839

Uh oh!

TheDarkTrumpet Jan 14, 2026

Replies: 2 comments · 9 replies

Uh oh!

0cc4m Jan 14, 2026 Collaborator

Uh oh!

0cc4m Jan 14, 2026 Collaborator

Uh oh!

TheDarkTrumpet Jan 14, 2026 Author

Uh oh!

0cc4m Jan 15, 2026 Collaborator

Uh oh!

d-shehu Jan 16, 2026

Uh oh!

TheDarkTrumpet Jan 16, 2026 Author

Uh oh!

Uh oh!

d-shehu Jan 16, 2026

Uh oh!

TheDarkTrumpet Jan 16, 2026 Author

Uh oh!

TheDarkTrumpet Jan 17, 2026 Author

Uh oh!

Uh oh!

rene-descartes2021 Jan 22, 2026

TheDarkTrumpet
Jan 14, 2026

Replies: 2 comments 9 replies

0cc4m
Jan 14, 2026
Collaborator

0cc4m Jan 14, 2026
Collaborator

TheDarkTrumpet Jan 14, 2026
Author

0cc4m Jan 15, 2026
Collaborator

TheDarkTrumpet Jan 16, 2026
Author

TheDarkTrumpet Jan 16, 2026
Author

TheDarkTrumpet Jan 17, 2026
Author

rene-descartes2021
Jan 22, 2026