transfer KV cache from higher quant to lower quant #21601
okuvshynov
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Let's consider the following setup:
Because we process prompt just once and can use large batch, we could:
I did a quick experiment to check - will it actually help anything? I didn't implement any chunked loading, just tested what's going to happen if I transfer KV cache (code). Setup is based on 3 inference runs:
Then we measure KLD for logits for tokens produced by ref, compare target KLD vs handoff KLD.
The single most interesting chart is probably this:
This shows KLD aggregated over all prompts by position. Solid line represents 'target' - 'lower quant baseline', dashed line represent KV cache transferred from BF16.
As we can see:
I've published more data/charts for Qwen 3.5 and Gemma 4 model families here, all models show similar trend, but medium-sized MoE (gemma4-26b-a4b and qwen3.5-35b-a3b) shows the strongest effect (and large Qwen 397B - the weakest)
Question: are there any benchmarks which would be sensitive enough to show the difference between quants of medium-sized models? KLD is not a perfect proxy metric for quality, so I wonder if there's something I could run and get more reliable quality measurements?
Beta Was this translation helpful? Give feedback.
All reactions