TurboQuant-MoE: 8.5x KV compression with 100% recall on 128k context #21135
RemizovDenis
started this conversation in
Ideas
Replies: 1 comment
-
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
We’ve spent the last 12 hours building a production-ready implementation of Google DeepMind's TurboQuant (arXiv:2504.19874). On top of the original paper, we added several custom extensions specifically for Mixture-of-Experts (MoE) models.
Benchmark Results (synthetic, reproducible):
For context: KIVI (current SOTA) achieves ~5x real compression at 94% recall on 104k. The original authors suggested that >8x at 100% recall was practically unachievable.
What we added on top of the Google paper:
Integration:
from transformers import AutoModelForCausalLM
from turboquant import patch_moe_model, auto_config
model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-v0.1", torch_dtype=torch.float16)
model = patch_moe_model(model, auto_config(model))
Same API. 8x less GPU RAM.
MIT License. Full source available on GitHub.
Search: RemizovDenis/turboquant
Beta Was this translation helpful? Give feedback.
All reactions