llama.cpp on AI Max 395+ with shared memory #18839
-
|
Good morning, I've been trying to setup this Framework linux box properly and found that - well some stuff doesn't make sense. In short, I'm trying to run two 30B Qwen A3B models into memory, with different instances of llama.cpp. Prior to doing all this, I set up the system so that there's unified memory (so I use the full 128GB of available memory). It's there, and I'm sure it's working, but some of the numbers simply don't make sense. Loading the first model, a 30B coding A3B model: The backend is vulkan - the weird part I see is "shared memory: 65536", which is very odd. This is launched when nothing else is running. I then try to load the second instance (Qwen VL A3B 30B), and here's what I'm seeing: Similar information, shared memory still shows 65k. Projected use is lower than free space as expected. Says it'll leave ~52Gb after load. But, doesn't load, with an eventual: I'm building llama from source, through a Dockerfile. I have versions for both vulkan and non-vulkan alike. Contents below: The numbers don't make much sense to me. Is anyone aware of any workarounds? Apparently others have had similar issues, and pull requests have been done to deal with reporting. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 9 replies
-
|
The shared memory number refers to the amount of memory in bytes your GPU can share inside of a workgroup, it is not related to VRAM. However, the actual available VRAM number is also shown by the server, for example: |
Beta Was this translation helpful? Give feedback.
-
|
In my troubleshooting trying to use an AMD APU and unified memory (not GTT), I found that the amdgpu driver relies on the motherboard manufacturer to populate the CRAT table. If it's invalid it will fall back to a "virtual" population with CPU and dGPU nodes not an APU node (I don't have a dGPU but it sets it up that way as a fallback). Following this the amdgpu driver won't use unified memory for me as far as I can figure, as it's not seeing an APU in the first place... Sorry I don't know all terminology and maybe conceptualised things wrong. Using the vulkan backend and the memory through GTT [1][2] works for me but I suspect the performance not being possibly as good as either XNACK or other subsequent means of unified memory they've experimented with (out of my depth there)? Dunno if your situation is similar. Solutions may be to fix the ACPI CRAT table, the virtual CRAT table population algorithm, or hope the motherboard manufacturer fixes their table in a BIOS update. |
Beta Was this translation helpful? Give feedback.
I ran into a similar issue while trying to use my iGPU on my AMD system. I modified the page limit and pool sizes and I was able to use roughly 110GB for llama.cpp out of 125 GB available (2-3GB is reserved for I/O).
GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=on iommu=pt kvm.ignore_msrs=1 amdgpu.gttsize=110592 amdttm.page_pool_size=27648000 amdttm.pages_limit=27648000 ttm.pages_limit=27648000 ttm.page_pool_size=27648000"