Why is the prompt cache (context checkpoints) for Gemma 4 so fat? #21480

Dampfinchen · 2026-04-05T17:59:42Z

Dampfinchen
Apr 5, 2026

I didn't want to make an issue since I'm not sure if this is normal behavior or not, but I have noticed Gemma 4 26B A4B is taking so much RAM for its prompt cache, that it quickly becomes unusable at higher context on my 32 GB RAM system.

Qwen 3 35B A3B

slot update_slots: id 3 | task 0 | created context checkpoint 1 of 32 (pos_min = 8191, pos_max = 8191, n_tokens = 8192, size = 62.813 MiB)

As you can see it just needs 62 MB per checkpoint. So the pressure on RAM is not high.

Gemma 4 however....

slot update_slots: id 3 | task 0 | created context checkpoint 1 of 32 (pos_min = 3072, pos_max = 8191, n_tokens = 8192, size = 531.309 MiB)

Wow, that is a huge difference. It needs nearly 9x the RAM for the prompt cache.

Is this normal behavior? Both were used with q8_0 kv cache (if it matters at all). I'm aware these are different architectures, but I'm not sure the context checkpoints themselves should differ that much.

Offset0x · 2026-04-07T09:24:33Z

Offset0x
Apr 7, 2026

--cache-ram 0 --ctx-checkpoints 1

0 replies

ggerganov · 2026-04-07T09:49:25Z

ggerganov
Apr 7, 2026
Maintainer

It is expected - the Qwen3.5 attention mechanism is much more memory efficient compared to Gemma 4 thanks to the recurrent state.

As suggested by @Offset0x, you have plenty of options to choose how much memory and how often to save the Gemma 4 checkpoints to fit your use case.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the prompt cache (context checkpoints) for Gemma 4 so fat? #21480

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Why is the prompt cache (context checkpoints) for Gemma 4 so fat? #21480

Uh oh!

Dampfinchen Apr 5, 2026

Replies: 2 comments

Uh oh!

Offset0x Apr 7, 2026

Uh oh!

ggerganov Apr 7, 2026 Maintainer

Dampfinchen
Apr 5, 2026

Offset0x
Apr 7, 2026

ggerganov
Apr 7, 2026
Maintainer