TurboQuant-MoE: 8.5x KV compression with 100% recall on 128k context #21135

RemizovDenis · 2026-03-28T23:26:06Z

RemizovDenis
Mar 28, 2026

Hi everyone,
We’ve spent the last 12 hours building a production-ready implementation of Google DeepMind's TurboQuant (arXiv:2504.19874). On top of the original paper, we added several custom extensions specifically for Mixture-of-Experts (MoE) models.
Benchmark Results (synthetic, reproducible):

KV compression ratio: 8.53x
Needle recall @ 128k: 100%
Retrieval degradation: 0.0%
GPU memory saved: 6.42 GB
IO-bound decode speedup: 8.48x
MoE expert cache hit rate: 96.75%
Predictor latency (mean): 0.099 ms
For context: KIVI (current SOTA) achieves ~5x real compression at 94% recall on 104k. The original authors suggested that >8x at 100% recall was practically unachievable.
What we added on top of the Google paper:
NashMoE Router: Replaces greedy top-k with a Nash Equilibrium between three agents: token affinity, expert load resistance, and GPU locality. Multiplicative aggregation ensures true consensus routing.
Markov Trajectory Predictor: Models token routing as a Hidden Markov Chain. It predicts expert activation 3 layers ahead with 96.75% accuracy, starting async CPU→GPU prefetching before compute reaches those layers.
PID VRAM Controller: Uses classical control theory (Proportional-Integral-Derivative) to dynamically adjust expert cache size as batch size fluctuates, preventing OOM crashes.
Semantic KV Eviction: Instead of a simple sliding window, we drop semantically unimportant tokens using a lightweight scorer (<500K params). Always preserves sink tokens and critical context.
Cross-Layer KV Sharing: Since adjacent layers often have cosine similarity >0.85, we store full KV only for anchor layers and compressed deltas for others. This adds another 1.5-2x compression.
Adaptive Bitwidth: 1-4 bits per token based on importance (e.g., 1 bit for punctuation, 4 bits for predicates). Average is ~2.1 bits.
Integration:
from transformers import AutoModelForCausalLM
from turboquant import patch_moe_model, auto_config

model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-v0.1", torch_dtype=torch.float16)
model = patch_moe_model(model, auto_config(model))

Same API. 8x less GPU RAM.

MIT License. Full source available on GitHub.
Search: RemizovDenis/turboquant

jonnor · 2026-03-29T10:06:49Z

jonnor
Mar 29, 2026

Link, https://github.com/RemizovDenis/turboquant

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TurboQuant-MoE: 8.5x KV compression with 100% recall on 128k context #21135

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

TurboQuant-MoE: 8.5x KV compression with 100% recall on 128k context #21135

Uh oh!

RemizovDenis Mar 28, 2026

Same API. 8x less GPU RAM.

Replies: 1 comment

Uh oh!

jonnor Mar 29, 2026

RemizovDenis
Mar 28, 2026

jonnor
Mar 29, 2026