Replies: 2 comments 2 replies
-
|
Try to add --sleep-idle-seconds 10 in your llama-server command or add sleep-idle-seconds = 10 in your ini file. Loaded model should unload after 10 seconds of inactivity. |
Beta Was this translation helpful? Give feedback.
1 reply
-
|
I have |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I set up llama-server to enable router mode. Everything starts up fine and the two models I have in the ini file are there just like expected. I can even send a chat to one and it gets loaded automatically. I have 2 AMD Pro W6800s so 64g VRAM total and I typically run gpt-oss-120b witch takes up most of that vram. When it is loaded and I request a second model that is to big to fit in the available vram I just get and error that the model failed to load. I run llama-server via systemd like this:
ExecStart=/opt/llama.cpp/llama-server
--device ROCm0,ROCm1
--models-preset /opt/llama-server/config.ini
--host 0.0.0.0 --port 8001
and here is the config.ini
[*]
fit = off
models-max = 1
models-autoload = 1
[gpt-oss-20b]
model = /opt/llama-server/gpt-oss-20b-mxfp4.gguf
flash-attn = 1
n-gpu-layers = 999
ctx-size = 32768
jinja = 1
[gpt-oss-120b]
model = /opt/llama-server/gpt-oss-120b-mxfp4-00001-of-00003.gguf
ctx-size = 32768
flash-attn = 1
n-gpu-layers = 999
n-cpu-moe = 10
jinja = 1
I updated this morning just in case it was a bug and I am currently running version 7771 on Fedora Linux 43 KDE. I have 128g of DDR5 in the machine and I compiled for both HIP and Vulkan if that helps. I have not tested using Vulkan yet, just HIP/ROCm. I have rocm 6.4.2
Am I just not understanding something about the config or capabilities, hove something wrong for enabling that functionality, or what?
Beta Was this translation helpful? Give feedback.
All reactions