Why is the prompt cache (context checkpoints) for Gemma 4 so fat? #21480
Dampfinchen
started this conversation in
General
Replies: 2 comments
-
|
--cache-ram 0 --ctx-checkpoints 1 |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
It is expected - the Qwen3.5 attention mechanism is much more memory efficient compared to Gemma 4 thanks to the recurrent state. As suggested by @Offset0x, you have plenty of options to choose how much memory and how often to save the Gemma 4 checkpoints to fit your use case. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I didn't want to make an issue since I'm not sure if this is normal behavior or not, but I have noticed Gemma 4 26B A4B is taking so much RAM for its prompt cache, that it quickly becomes unusable at higher context on my 32 GB RAM system.
Qwen 3 35B A3B
slot update_slots: id 3 | task 0 | created context checkpoint 1 of 32 (pos_min = 8191, pos_max = 8191, n_tokens = 8192, size = 62.813 MiB)As you can see it just needs 62 MB per checkpoint. So the pressure on RAM is not high.
Gemma 4 however....
slot update_slots: id 3 | task 0 | created context checkpoint 1 of 32 (pos_min = 3072, pos_max = 8191, n_tokens = 8192, size = 531.309 MiB)Wow, that is a huge difference. It needs nearly 9x the RAM for the prompt cache.
Is this normal behavior? Both were used with q8_0 kv cache (if it matters at all). I'm aware these are different architectures, but I'm not sure the context checkpoints themselves should differ that much.
Beta Was this translation helpful? Give feedback.
All reactions