llama-fit-params: fix overflow check #18354

JohannesGaessler · 2025-12-24T22:13:43Z

On master llama_params_fit is checking whether with an overflowing layer is projected to fit on the current GPU but there is no check for whether the allocation is projected to also fit on the next GPU that the layer is overflowing to. This can result in an allocation of more memory than should be allowed by --fit-target.

Panchovix · 2025-12-28T17:10:55Z

Sorry to bother here in a PR, but does it take in mind for an increase of size in ubatch size?

I.e. with:

./llama-server \
  -m '/run/media/pancho/MyDrive/models_llm_2tb/GLM-4.7-UD-Q4_K_XL.gguf' \
  -c 32768 \
  --no-mmap \
  -ngl 999 \
  -mg 0 \
  -ub 2048 -b 2048 -fitc 32768

Loads in my setup, but it gets OOM on CUDA 4 when generating a first message. At stock ub (512) it works fine.

While with this one:

./llama-server \
  -m '/run/media/pancho/MyDrive/models_llm_2tb/GLM-4.7-UD-Q4_K_XL.gguf' \
  -c 32768 \
  --no-mmap \
  -ngl 999 \
  -ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14).ffn.=CUDA0" \
  -ot "blk.(15|16|17|18|19|20|21|22|23|24|25|26).ffn.=CUDA1" \
  -ot "blk.(27|28|29|30|31|32|33|34|35).ffn.=CUDA2" \
  -ot "blk.(36|37|38|39|40|41|42|43|44).ffn.=CUDA3" \
  -ot "blk.(45|46|47|48|49|50|51|52|53).ffn.=CUDA4" \
  -ot "blk.(54|55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70|71|72|73).ffn.=CUDA5" \
  -ot "blk.(74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91|92).ffn.=CUDA6" \
  -mg 0 \
  -ub 2048 -b 2048

It works fine (it's basically almost the same as the one with fit, but with 1 layer less on CUDA 4)

JohannesGaessler · 2025-12-28T19:10:05Z

Physical batch size is taken into account. However, depending on what hardware, backend, and compilation flags you're using a higher physical batch size will result in larger runtime allocations by the ggml backend. As of right now there is no way to determine the size of these allocations ahead of time so you may need to specify a larger value for -fitt to go along with a higher value for -ub.

JohannesGaessler requested a review from ggerganov as a code owner December 24, 2025 22:13

loci-dev mentioned this pull request Dec 24, 2025

UPSTREAM PR #18354: llama-fit-params: fix overflow check auroralabs-loci/llama.cpp#694

Open

ggerganov approved these changes Dec 27, 2025

View reviewed changes

llama-fit-params: fix overflow check

8e3b7ff

JohannesGaessler force-pushed the llama-fp-fix-overflow branch from f29662a to 8e3b7ff Compare December 27, 2025 11:29

JohannesGaessler merged commit a4bf358 into ggml-org:master Dec 27, 2025
70 of 71 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-fit-params: fix overflow check #18354

llama-fit-params: fix overflow check #18354

Uh oh!

JohannesGaessler commented Dec 24, 2025

Uh oh!

Uh oh!

Panchovix commented Dec 28, 2025 •

edited

Loading

Uh oh!

JohannesGaessler commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

llama-fit-params: fix overflow check #18354

llama-fit-params: fix overflow check #18354

Uh oh!

Conversation

JohannesGaessler commented Dec 24, 2025

Uh oh!

Uh oh!

Panchovix commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Panchovix commented Dec 28, 2025 •

edited

Loading