Model router never unloads a model automatically #18939

ccbadd · 2026-01-19T19:37:35Z

ccbadd
Jan 19, 2026

I set up llama-server to enable router mode. Everything starts up fine and the two models I have in the ini file are there just like expected. I can even send a chat to one and it gets loaded automatically. I have 2 AMD Pro W6800s so 64g VRAM total and I typically run gpt-oss-120b witch takes up most of that vram. When it is loaded and I request a second model that is to big to fit in the available vram I just get and error that the model failed to load. I run llama-server via systemd like this:

ExecStart=/opt/llama.cpp/llama-server
--device ROCm0,ROCm1
--models-preset /opt/llama-server/config.ini
--host 0.0.0.0 --port 8001

and here is the config.ini
[*]
fit = off
models-max = 1
models-autoload = 1

[gpt-oss-20b]
model = /opt/llama-server/gpt-oss-20b-mxfp4.gguf
flash-attn = 1
n-gpu-layers = 999
ctx-size = 32768
jinja = 1

[gpt-oss-120b]
model = /opt/llama-server/gpt-oss-120b-mxfp4-00001-of-00003.gguf
ctx-size = 32768
flash-attn = 1
n-gpu-layers = 999
n-cpu-moe = 10
jinja = 1

I updated this morning just in case it was a bug and I am currently running version 7771 on Fedora Linux 43 KDE. I have 128g of DDR5 in the machine and I compiled for both HIP and Vulkan if that helps. I have not tested using Vulkan yet, just HIP/ROCm. I have rocm 6.4.2

Am I just not understanding something about the config or capabilities, hove something wrong for enabling that functionality, or what?

ali0une · 2026-01-21T19:00:43Z

ali0une
Jan 21, 2026

Try to add --sleep-idle-seconds 10 in your llama-server command or add sleep-idle-seconds = 10 in your ini file.

Loaded model should unload after 10 seconds of inactivity.

1 reply

ccbadd Jan 24, 2026
Author

Thanks I'll give that a try but my real issue is unloading a model in order to make room to load another. That is how I understood that it works but I get a memory allocation error instead and the model I am trying to query never gets loaded.

Try to add --sleep-idle-seconds 10 in your llama-server command or add sleep-idle-seconds = 10 in your ini file.

Loaded model should unload after 10 seconds of inactivity.

SlavikCA · 2026-01-24T18:27:52Z

SlavikCA
Jan 24, 2026

I have --models-max 1 added to the parent's process, not to .ini file.
And everything works, as expected. Only one model stays loaded.

services:
  llama-router:
    image: ghcr.io/ggml-org/llama.cpp
    container_name: router
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
    ports:
      - "8080:8080"
    volumes:
      - /home/slavik/.cache/llama.cpp/router:/root/.cache/llama.cpp/router
      - ./models.ini:/app/models.ini
    entrypoint: ["./llama-server"]
    command: >
      --models-dir /root/.cache/llama.cpp/router
      --models-max 1
      --models-preset ./models.ini
      --host 0.0.0.0  --port 8080

1 reply

ccbadd Jan 25, 2026
Author

I have --models-max 1 added to the parent's process, not to .ini file. And everything works, as expected. Only one model stays loaded.

That did the trick, Thanks! I just assumed that since the ini file is parsed during startup that it would be the same in there as passing the parameter on the command line.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model router never unloads a model automatically #18939

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Model router never unloads a model automatically #18939

Uh oh!

ccbadd Jan 19, 2026

Replies: 2 comments · 2 replies

Uh oh!

ali0une Jan 21, 2026

Uh oh!

ccbadd Jan 24, 2026 Author

Uh oh!

SlavikCA Jan 24, 2026

Uh oh!

ccbadd Jan 25, 2026 Author

ccbadd
Jan 19, 2026

Replies: 2 comments 2 replies

ali0une
Jan 21, 2026

ccbadd Jan 24, 2026
Author

SlavikCA
Jan 24, 2026

ccbadd Jan 25, 2026
Author