Skip to content

Conversation

@JohannesGaessler
Copy link
Collaborator

On master llama_params_fit is checking whether with an overflowing layer is projected to fit on the current GPU but there is no check for whether the allocation is projected to also fit on the next GPU that the layer is overflowing to. This can result in an allocation of more memory than should be allowed by --fit-target.

@JohannesGaessler JohannesGaessler merged commit a4bf358 into ggml-org:master Dec 27, 2025
70 of 71 checks passed
@Panchovix
Copy link

Panchovix commented Dec 28, 2025

Sorry to bother here in a PR, but does it take in mind for an increase of size in ubatch size?

I.e. with:

./llama-server \
  -m '/run/media/pancho/MyDrive/models_llm_2tb/GLM-4.7-UD-Q4_K_XL.gguf' \
  -c 32768 \
  --no-mmap \
  -ngl 999 \
  -mg 0 \
  -ub 2048 -b 2048 -fitc 32768

Loads in my setup, but it gets OOM on CUDA 4 when generating a first message. At stock ub (512) it works fine.

While with this one:

./llama-server \
  -m '/run/media/pancho/MyDrive/models_llm_2tb/GLM-4.7-UD-Q4_K_XL.gguf' \
  -c 32768 \
  --no-mmap \
  -ngl 999 \
  -ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14).ffn.=CUDA0" \
  -ot "blk.(15|16|17|18|19|20|21|22|23|24|25|26).ffn.=CUDA1" \
  -ot "blk.(27|28|29|30|31|32|33|34|35).ffn.=CUDA2" \
  -ot "blk.(36|37|38|39|40|41|42|43|44).ffn.=CUDA3" \
  -ot "blk.(45|46|47|48|49|50|51|52|53).ffn.=CUDA4" \
  -ot "blk.(54|55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70|71|72|73).ffn.=CUDA5" \
  -ot "blk.(74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91|92).ffn.=CUDA6" \
  -mg 0 \
  -ub 2048 -b 2048

It works fine (it's basically almost the same as the one with fit, but with 1 layer less on CUDA 4)

@JohannesGaessler
Copy link
Collaborator Author

Physical batch size is taken into account. However, depending on what hardware, backend, and compilation flags you're using a higher physical batch size will result in larger runtime allocations by the ggml backend. As of right now there is no way to determine the size of these allocations ahead of time so you may need to specify a larger value for -fitt to go along with a higher value for -ub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants