TriAttention KV Cache Pruning — Early llama.cpp Prototype #21619
domvox
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I've been experimenting with a llama.cpp port of TriAttention (KV cache pruning via trigonometric scoring). This is a research prototype, not production-ready, but the pipeline works end-to-end and includes GPU-side KV compaction after pruning.
TriAttention scores KV cache tokens using offline-calibrated pre-RoPE Q/K statistics, then evicts low-importance entries periodically. Unlike attention-based methods such as H2O or SnapKV, it does not need to compute full attention weights to decide what to keep.
Results (Qwen3-8B Q4_K_M, WikiText-2, ctx=4096)
At 50% retention, total wall time is already below baseline (41.9s vs ~58s for 5 chunks), because Flash Attention runs against a shorter effective cache after pruning. Quality is not yet paper-level, but the runtime signal is real.
What's implemented
llama_decodeCurrent limitations
Code
feature/triattention-scoringondomvox/llama.cpp-turboquant-hipCLI flags:
--triattention <stats.bin> --tri-budget 50 --tri-window 128 --tri-interval 256Feedback welcome
Hardware: RX 7900 XTX / gfx1100 / ROCm 6.4 / Ryzen 9 9950X3D
Beta Was this translation helpful? Give feedback.
All reactions