ci : discuss optimization strategies #20446

ggerganov · 2026-03-12T08:20:14Z

ggerganov
Mar 12, 2026
Maintainer

CISC · 2026-03-12T08:21:55Z

CISC
Mar 12, 2026
Collaborator

I think we also need to look into minimizing (however possible without compromising test coverage) the amount of builds running.

1 reply

ggerganov Mar 12, 2026
Maintainer Author

Yes, the total runtime also keeps increasing. Though I am less concern about it. The queue time is what we have much less control of.

ggerganov · 2026-03-12T08:22:01Z

ggerganov
Mar 12, 2026
Maintainer Author

How do we feel about disabling all automatic Github-hosted workflows for Pull Requests and delegate it to maintainers to manually decide which workflows to run and when for a given PR? The master branch continues to run all workflows by default.

6 replies

danbev Mar 12, 2026
Maintainer

I think this sounds reasonable and worth doing 👍

ggerganov Mar 12, 2026
Maintainer Author

We'll need to separate the workflow definitions so that self-hosted workflows are in separate .yml files from the non-self-hosted.

aldehir Mar 12, 2026
Collaborator

Maybe a minimal workflow for Ubuntu/macOS/Windows cpu-only builds to run tests? Then when the PR is refined enough, manually trigger the entire suite prior to merging.

JohannesGaessler Mar 13, 2026
Collaborator

Is it possible to create lists of workflows that are specific to e.g. the CUDA backend so that backend maintainers can simply press a button to test only the changed backend?

CISC Mar 13, 2026
Collaborator

No simple button I think, to do that we'd have to separate the workflows and you would have to manually specify which branch it should run on (but the PR branch refs don't show up in the list, not sure if it's possible?).

CISC · 2026-03-12T11:54:07Z

CISC
Mar 12, 2026
Collaborator

I wonder if our main issue is that we have many long running ubuntu runnners (ubuntu-24-cmake-vulkan in particular) saturating availability of it and causing a long backlog...

2 replies

ggerganov Mar 12, 2026
Maintainer Author

Yes, you might be right. This workflow seems to take too long:

https://github.com/ggml-org/llama.cpp/actions/runs/22963567444/job/66660728706

CISC Mar 12, 2026
Collaborator

Oh, BTW, we need to update our ccache-action:
https://github.com/ggml-org/llama.cpp/actions/runs/22963567444/job/66660728706#annotation:19:2

netrunnereve · 2026-03-12T17:24:49Z

netrunnereve
Mar 12, 2026
Collaborator

Well I was the one who set up the ubuntu-24-cmake-vulkan runner, it's super slow since it basically emulates a GPU on CPU to run the Vulkan tests. I've optimized it already but ultimately it's not going to go any faster because of the slow machines and the growing number of tests. Maybe it's possible to get it running on ARM but I've never gotten that to work. While it has its uses (mainly in finding interesting bugs that don't pop up in a real GPU, and also so that it can be run in forks) considering we have the CI machines with real GPUs now we can get rid of it if you feel it's too slow.

We also have way too many jobs in general. Aside from removing jobs you can also try spreading out the load between ARM and x86 machines if possible, some stuff like those cross compiles or webgpu runs can probably be done on ARM. There's also the new ubuntu-slim machine which they hopefully have more of and which we can use for simple jobs that only need 1 core and 5 gigs of memory.

1 reply

CISC Mar 12, 2026
Collaborator

Well I was the one who set up the ubuntu-24-cmake-vulkan runner, it's super slow since it basically emulates a GPU on CPU to run the Vulkan tests. I've optimized it already but ultimately it's not going to go any faster because of the slow machines and the growing number of tests. Maybe it's possible to get it running on ARM but I've never gotten that to work. While it has its uses (mainly in finding interesting bugs that don't pop up in a real GPU, and also so that it can be run in forks) considering we have the CI machines with real GPUs now we can get rid of it if you feel it's too slow.

We can keep it, but I think this is one we will have to either only run on Vulkan changes and/or manually.

We also have way too many jobs in general. Aside from removing jobs you can also try spreading out the load between ARM and x86 machines if possible, some stuff like those cross compiles or webgpu runs can probably be done on ARM.

Indeed, spreading them across different arches may help.

There's also the new ubuntu-slim machine which they hopefully have more of and which we can use for simple jobs that only need 1 core and 5 gigs of memory.

Yep, already transitioned a few jobs to that. :)

ggerganov · 2026-03-13T10:25:50Z

ggerganov
Mar 13, 2026
Maintainer Author

I changed the setting to require Github Actions approval for all external contributors:

I'm just not sure who will be able to make the approvals - is it only org members, or collaborators with write access too. Hopefully it is the latter - let's see.

Also, I am wondering if we should reduce the "Artifact and log retention" down to 30 days?

10 replies

CISC Mar 13, 2026
Collaborator

Nope. :(

ggerganov Mar 13, 2026
Maintainer Author

Most likely it will appear on new PRs - let's see

CISC Mar 13, 2026
Collaborator

Ah, yes, it did, at least on Contributor PRs.

ggerganov Mar 13, 2026
Maintainer Author

Ok, so I think this can work:

We now have a new team: @ggml-org/maintainers
Need to send invites to all existing collaborators with write access to join that team
This way, they become members of https://github.com/ggml-org and will be able to approve workflows

ggerganov Mar 13, 2026
Maintainer Author

Even though 30 days is probably fine, is there a downside to having it at 90 days?

Not sure. Will leave it at 90 days for now.

ServeurpersoCom · 2026-03-13T13:30:45Z

ServeurpersoCom
Mar 13, 2026
Collaborator

I can set up a dedicated Podman container with GPU access that starts automatically with my AI server (Ryzen 9 9950X3D 96GB DDR5). It won't interfere with my other workloads and can run CUDA and Vulkan workflows at full speed on a real GPU (RTX PRO 6000) ?

It would be a clean pod with minimal Debian Containerfile / yaml, with the latest CUDA/Vulkan, that anyone in our group could download and instantiate to run the pipeline.

2 replies

ggerganov Mar 14, 2026
Maintainer Author

Ideally, the runners should run only the workflows and nothing else to avoid interference. So it's better to enroll fully-dedicated machines as self-hosted runners.

ServeurpersoCom Mar 14, 2026
Collaborator

We could get a dedicated server with the smallest NVIDIA GPU from a professional hosting provider, the team would administer it via SSH. Something like https://www.hetzner.com/dedicated-rootserver/gex44/ (RTX 4000 Ada, 20GB, €184/mo)

taronaeo · 2026-03-14T13:26:56Z

taronaeo
Mar 14, 2026
Collaborator

I have a dedicated server with AMD Ryzen 7 2700X (8C 16T), 32 GB DDR4, 4 TB storage and an NVIDIA RTX 2060 GPU doing nothing currently.

I believe it can run both CUDA and Vulkan CI workloads. Let me know if my configuration is feasible and we can onboard my server as part of the self-hosted runners.

12 replies

taronaeo Mar 20, 2026
Collaborator

Here are some statistics from the last 24 hours. Yellow highlighted background indicates that a job is running.

Looking at the CPU utilisation, I suppose we can trial run adding another parallel runner specifically for CPU-only CIs since there are still pockets of the server being idle. We can also pin the physical CPU cores as such:

GPU runner: Cores 0-3
CPU runner: Cores 4-7

That way there shouldn't be much contention since it has full access to bare-metal hardware. Memory wise, we can allocate approximately 10 GiB to both runners since they utilise only that much memory.

The new parallel runner can also be the first candidate to take up the generic CPU workflows.

Let me know what you think, or if I should scrap the idea completely haha

ggerganov Mar 20, 2026
Maintainer Author

I think it's worth the experiment.

The new parallel runner can also be the first candidate to take up the generic CPU workflows.

Actually, I'd rather first move the CPU workflows to the existing runners as I consider them to be more long-term stable. Don't want to put too much pressure on the sg-hl1-ci-nvidia-vulkan-cm yet - first lets see if it is stable enough for a few days/weeks.

taronaeo Apr 5, 2026
Collaborator

Regarding

Move some of the ggml-ci--cpu- workflows to self-hosted runners to reduce some of the GH cache

and

Move the CPU server workflows to self-hosted runners as they are quite important and currently have a long queue (at least the linux-based ones)

I think I am ready to deploy 2x x86 self-hosted runners soon and 2-4x ARM64 (Ampere A1) runners at a later date as they are still being tested. Both architectures will deploy ephemeral runners managed by GitHub Self-Hosted ARC, which requires a GitHub App to be installed, similar to our IBM Power and Z Runners GitHub App.

But I am wondering how viable is it for us to actually move those CPU workflows to self-hosted though? From what I am reading, GitHub supplies at least 20 runners (docs) and if we were to move the workflows to self-hosted runners, we would bottleneck ourselves since we only have a handful of self-hosted runners available. Also, it appears that we cannot mix GitHub-hosted runners with self-hosted reduce the queue time.

Are we still considering the shift to self-hosted runners? By any chance am I misunderstanding something?

ggerganov Apr 6, 2026
Maintainer Author

But I am wondering how viable is it for us to actually move those CPU workflows to self-hosted though?

The idea was to offload some of the jobs to self-hosted runners in order to free space in Github cache. My understanding is that this will reduce the cache trashing and thus improve the overall speed of the CI.

On my end, I've de-prioritized this because I have to first learn how to sandbox my Macs and DGX Spark before adding them back as self-hosted runners.

Also, it appears that we cannot mix GitHub-hosted runners with self-hosted reduce the queue time.

How do we know that?

Are we still considering the shift to self-hosted runners? By any chance am I misunderstanding something?

I think we managed to significantly improve the state of the CI compared to when this discussion was opened. I guess the biggest impact was from switching to manual approval of the workflows.

I'm up to try using the 2x x86 runners that you've prepared and see how this would go. My understanding is that it should reduce the queue time - almost any self-hosted machine would be faster than the Github runners + it will have unlimited ccache.

netrunnereve Apr 6, 2026
Collaborator

Also, it appears that we cannot mix GitHub-hosted runners with self-hosted reduce the queue time.

I'm pretty sure you can mix them if you set runs-on like this runs-on: ${{ 'ubuntu-24.04-arm' || 'ubuntu-24.04' }}, just add your self hosted machine to the list and make it first so that it will be prioritized.

CISC · 2026-03-15T13:20:21Z

CISC
Mar 15, 2026
Collaborator

* [x]  Update `ccache-action`: [ci : discuss optimization strategies #20446 (reply in thread)](https://github.com/ggml-org/llama.cpp/discussions/20446#discussioncomment-16097521)

What needs to be updated is this:
https://github.com/ggml-org/ccache-action/blob/main/action.yml#L45

Not sure if this will break anything though, this is not yet done upstream, so no use syncing our fork yet either.

0 replies

CISC · 2026-03-16T22:48:25Z

CISC
Mar 16, 2026
Collaborator

@ggerganov ccache just got updated, can you sync our fork?
https://github.com/hendrikmuhs/ccache-action/releases/tag/v1.2.21

2 replies

ggerganov Mar 17, 2026
Maintainer Author

Done

CISC Mar 17, 2026
Collaborator

Thanks, I'll update CIs.

IMbackK · 2026-03-18T19:49:31Z

IMbackK
Mar 18, 2026
Collaborator

I guess i am currently on my way of making this worse with #20430, i have quite a few more checks i would like to add to this workflow in the future. I would be happy to have this restricted to running when i press a button on prs that affect the hip backed.

1 reply

ggerganov Mar 19, 2026
Maintainer Author

I think it's OK to add. If we get overloaded at some point, we can consider only running it on master automatically and for the PRs to be manual.

When we provision some AMD hardware in the future, we will offload these workflows there.

ci : discuss optimization strategies #20446

Uh oh!

Uh oh!

ggerganov Mar 12, 2026 Maintainer

Overview

TODOs

Upcoming self-hosted runners

Replies: 10 comments · 37 replies

Uh oh!

Uh oh!

CISC Mar 12, 2026 Collaborator

Uh oh!

ggerganov Mar 12, 2026 Maintainer Author

Uh oh!

ggerganov Mar 12, 2026 Maintainer Author

Uh oh!

danbev Mar 12, 2026 Maintainer

Uh oh!

ggerganov Mar 12, 2026 Maintainer Author

Uh oh!

aldehir Mar 12, 2026 Collaborator

Uh oh!

JohannesGaessler Mar 13, 2026 Collaborator

Uh oh!

CISC Mar 13, 2026 Collaborator

Uh oh!

CISC Mar 12, 2026 Collaborator

Uh oh!

ggerganov Mar 12, 2026 Maintainer Author

Uh oh!

CISC Mar 12, 2026 Collaborator

Uh oh!

Uh oh!

netrunnereve Mar 12, 2026 Collaborator

Uh oh!

CISC Mar 12, 2026 Collaborator

Uh oh!

ggerganov Mar 13, 2026 Maintainer Author

Uh oh!

CISC Mar 13, 2026 Collaborator

Uh oh!

ggerganov Mar 13, 2026 Maintainer Author

Uh oh!

CISC Mar 13, 2026 Collaborator

Uh oh!

ggerganov Mar 13, 2026 Maintainer Author

Uh oh!

ggerganov Mar 13, 2026 Maintainer Author

Uh oh!

Uh oh!

ServeurpersoCom Mar 13, 2026 Collaborator

Uh oh!

ggerganov Mar 14, 2026 Maintainer Author

Uh oh!

ServeurpersoCom Mar 14, 2026 Collaborator

Uh oh!

taronaeo Mar 14, 2026 Collaborator

Uh oh!

taronaeo Mar 20, 2026 Collaborator

Uh oh!

ggerganov Mar 20, 2026 Maintainer Author

Uh oh!

taronaeo Apr 5, 2026 Collaborator

Uh oh!

ggerganov Apr 6, 2026 Maintainer Author

Uh oh!

netrunnereve Apr 6, 2026 Collaborator

Uh oh!

CISC Mar 15, 2026 Collaborator

Uh oh!

CISC Mar 16, 2026 Collaborator

ggerganov
Mar 12, 2026
Maintainer

Replies: 10 comments 37 replies

CISC
Mar 12, 2026
Collaborator

ggerganov Mar 12, 2026
Maintainer Author

ggerganov
Mar 12, 2026
Maintainer Author

danbev Mar 12, 2026
Maintainer

ggerganov Mar 12, 2026
Maintainer Author

aldehir Mar 12, 2026
Collaborator

JohannesGaessler Mar 13, 2026
Collaborator

CISC Mar 13, 2026
Collaborator

CISC
Mar 12, 2026
Collaborator

ggerganov Mar 12, 2026
Maintainer Author

CISC Mar 12, 2026
Collaborator

netrunnereve
Mar 12, 2026
Collaborator

CISC Mar 12, 2026
Collaborator

ggerganov
Mar 13, 2026
Maintainer Author

CISC Mar 13, 2026
Collaborator

ggerganov Mar 13, 2026
Maintainer Author

CISC Mar 13, 2026
Collaborator

ggerganov Mar 13, 2026
Maintainer Author

ggerganov Mar 13, 2026
Maintainer Author

ServeurpersoCom
Mar 13, 2026
Collaborator

ggerganov Mar 14, 2026
Maintainer Author

ServeurpersoCom Mar 14, 2026
Collaborator

taronaeo
Mar 14, 2026
Collaborator

taronaeo Mar 20, 2026
Collaborator

ggerganov Mar 20, 2026
Maintainer Author

taronaeo Apr 5, 2026
Collaborator

ggerganov Apr 6, 2026
Maintainer Author

netrunnereve Apr 6, 2026
Collaborator

CISC
Mar 15, 2026
Collaborator

CISC
Mar 16, 2026
Collaborator