TurboQuant-MoE is a KV-cache compression and dynamic MoE expert management engine for LLM inference.
Large-context and MoE inference is usually constrained by VRAM and memory bandwidth. TurboQuant targets this bottleneck with:
- 1/2/3-bit Polar quantization for KV tensors
- QJL residual correction for fidelity preservation
- Cross-layer KV sharing and delta-based compression
- MoE expert cache and prefetch primitives
The project is currently distributed from source.
git clone https://github.com/RemizovDenis/turboquant.git
cd turboquant
python3 -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -e ".[dev,transformers,benchmark]"from turboquant.core.turboquant import TurboQuantKVCache, TurboQuantConfig
config = TurboQuantConfig(
head_dim=128,
num_heads=32,
bits=3,
residual_correction=True,
)
cache = TurboQuantKVCache(config)
compressed = cache.compress(keys=key_tensor, values=val_tensor)
recon_k, recon_v = cache.decompress(compressed)ruff check turboquant tests
ruff format --check turboquant tests
mypy turboquant --strict
pytest tests/ -v --tb=short -x -k "not gpu and not cuda and not triton"- HuggingFace Transformers (
TurboQuantCache) - Vector databases (Qdrant, ChromaDB, NumPy adapter)
- Ollama/vLLM integration helpers
Business Source License 1.1 (BUSL-1.1). Commercial use requires a commercial license. Converts to Apache-2.0 on 2030-04-01.