transfer KV cache from higher quant to lower quant #21601

okuvshynov · 2026-04-08T02:47:45Z

okuvshynov
Apr 8, 2026

Let's consider the following setup:

Gemma4-31B@Q8 is >30GB and thus won't fit into typical 24-32GB GPU memory.
Gemma4-31B@Q4 is 18GB and can be used on such systems.
For simplicity, let's assume single prompt scenario, no concurrency

Because we process prompt just once and can use large batch, we could:

Load 'full' (Q8/BF16) model to GPU in chunks, process the prompt and populate KV cache;
Unload the full model, load more aggressively quantized model and continue generation;

I did a quick experiment to check - will it actually help anything? I didn't implement any chunked loading, just tested what's going to happen if I transfer KV cache (code). Setup is based on 3 inference runs:

ref - BF16 processes the prompt and generates 2048 tokens for ~20 prompts. Logits are saved as ground truth.
target - quantized model replays the same token sequence. This is the “just use lower quant” baseline.
handoff - BF16 processes the prompt, its KV cache is saved and loaded into the quantized model, which then replays the generation tokens.

Then we measure KLD for logits for tokens produced by ref, compare target KLD vs handoff KLD.

The single most interesting chart is probably this:

This shows KLD aggregated over all prompts by position. Solid line represents 'target' - 'lower quant baseline', dashed line represent KV cache transferred from BF16.

As we can see:

Q5 with handoff starts 'better' than Q8 without;
As we generate more tokens, we see effect decay;

I've published more data/charts for Qwen 3.5 and Gemma 4 model families here, all models show similar trend, but medium-sized MoE (gemma4-26b-a4b and qwen3.5-35b-a3b) shows the strongest effect (and large Qwen 397B - the weakest)

Question: are there any benchmarks which would be sensitive enough to show the difference between quants of medium-sized models? KLD is not a perfect proxy metric for quality, so I wonder if there's something I could run and get more reliable quality measurements?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transfer KV cache from higher quant to lower quant #21601

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

transfer KV cache from higher quant to lower quant #21601

Uh oh!

Uh oh!

okuvshynov Apr 8, 2026

Replies: 0 comments

okuvshynov
Apr 8, 2026