TriAttention KV Cache Pruning — Early llama.cpp Prototype #21619

domvox · 2026-04-08T11:38:52Z

domvox
Apr 8, 2026

I've been experimenting with a llama.cpp port of TriAttention (KV cache pruning via trigonometric scoring). This is a research prototype, not production-ready, but the pipeline works end-to-end and includes GPU-side KV compaction after pruning.

TriAttention scores KV cache tokens using offline-calibrated pre-RoPE Q/K statistics, then evicts low-importance entries periodically. Unlike attention-based methods such as H2O or SnapKV, it does not need to compute full attention weights to decide what to keep.

Results (Qwen3-8B Q4_K_M, WikiText-2, ctx=4096)

Retention	PPL	Δ vs baseline
100% (f16 KV)	8.15	—
75%	8.21	+0.7%
50%	8.49	+4.1%
25%	8.64	+6.0%
10%	8.79	+7.8%

At 50% retention, total wall time is already below baseline (41.9s vs ~58s for 5 chunks), because Flash Attention runs against a shorter effective cache after pruning. Quality is not yet paper-level, but the runtime signal is real.

What's implemented

Offline calibration (Python) → TRIA stats file
Optimized C scoring library
Runtime scoring hook in llama_decode
Global top-K retention after normalized score aggregation
Physical KV compaction with a HIP gather kernel

Current limitations

Scoring is still CPU-side: K cache is transferred from GPU to CPU for scoring, and this is the main bottleneck.
Global aggregation only: the paper uses more adaptive retention structure; this prototype collapses everything to one global mask.
Single-model validation: only Qwen3-8B so far.
Short context: tested at 4K, not yet in the 32K+ regime where KV pressure matters most.
No allocation shrink yet: compaction reduces the used KV region, but the full KV buffer is still pre-allocated.

Code

Python prototype + C scoring library: domvox/triattention-ggml
llama.cpp integration branch: feature/triattention-scoring on domvox/llama.cpp-turboquant-hip

CLI flags:
--triattention <stats.bin> --tri-budget 50 --tri-window 128 --tri-interval 256

Feedback welcome

Is there interest in TriAttention-style pruning in llama.cpp?
Has anyone benchmarked TriAttention against H2O / SnapKV / StreamingLLM on the same model stack?

Hardware: RX 7900 XTX / gfx1100 / ROCm 6.4 / Ryzen 9 9950X3D

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TriAttention KV Cache Pruning — Early llama.cpp Prototype #21619

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

TriAttention KV Cache Pruning — Early llama.cpp Prototype #21619

Uh oh!

domvox Apr 8, 2026

Results (Qwen3-8B Q4_K_M, WikiText-2, ctx=4096)

What's implemented

Current limitations

Code

Feedback welcome

Replies: 0 comments

domvox
Apr 8, 2026