ggml-cpu: AVX-512-VNNI dot-products for Q1_0/Q2_0 by bri-prism · Pull Request #37 · PrismML-Eng/llama.cpp

bri-prism · 2026-06-02T20:51:24Z

What

Adds an AVX-512-VNNI / AVX-512VL fast path for the Q1_0 and Q2_0 CPU dot products. Both formats previously had no x86 vec_dot path — arch-fallback routed the _generic functions to a scalar loop. The scalar path is preserved as the #else fallback (guarded by __AVX512VNNI__ && __AVX512VL__).

Why

The low-bit CPU dot product is the hot kernel for Bonsai inference on CPU backends. Q2_0 in particular was leaving a large amount of x86 throughput on the table.

How

helper ggml_hsum_i32_8_vnni to reduce _mm256_dpbusd_epi32 accumulators
Q1_0: build a sign mask from the bit field, blend +qy/-qy, accumulate with dpbusd(ones, sel)
Q2_0: vectorized 2-bit unpack (replicate-4 + 16-bit shift/mask + pack), then dpbusd(codes, qy) - dpbusd(ones, qy) = sum((code-1)*qy)

Performance (EPYC 9655, AVX-512-VNNI)

Format	prefill	decode
Q2_0	~3.9×	~3.0×
Q1_0	~parity (the ±1 scalar loop already auto-vectorizes)

Scales: holds 3.4–4× through -d 2048; scalar Q2_0 times out at -d 8192.

Correctness

Bit-exact vs the scalar reference: standalone unit test (2000+ random blocks) + test-quantize-fns dot-product error identical with/without the path.
KL divergence vs FP16 (Bonsai-8B, packed_models_KL_validation flow): Q2_0 mean KLD 0.000135, top-1 99.28%, PPL +0.14% — passes thresholds. Scalar and VNNI builds produce byte-identical KLD/top-1, confirming the kernel preserves outputs end-to-end.
Verified to build + run on this prism tree (above numbers reproduced here).

Notes

This is the public-fork counterpart of the approved internal PR (llama.cpp-private#1), landed on prism per review. The one red CI signal — test-quantize-fns on q2_0 — is pre-existing on prism (the {−1,0,1,2} 2-bit format exceeds the generic 2-bit threshold); it reports identical numbers with and without this change.

Q1_0/Q2_0 had no x86 vec_dot path (arch-fallback routed the generic functions to a scalar loop). Add an AVX-512-VNNI/AVX-512VL fast path guarded by __AVX512VNNI__ && __AVX512VL__, scalar fallback otherwise: - helper ggml_hsum_i32_8_vnni to reduce _mm256_dpbusd_epi32 accumulators - Q1_0: build a sign mask from the bit field, blend +qy/-qy, accumulate with dpbusd(ones, sel) - Q2_0: vectorized 2-bit unpack (replicate-4 + 16-bit shift/mask + pack), then dpbusd(codes, qy) - dpbusd(ones, qy) = sum((code-1)*qy) Q2_0 prefill ~3.9x / decode ~3.0x vs scalar on EPYC 9655; Q1_0 ~parity (the +/-1 scalar loop already auto-vectorizes). Bit-exact vs scalar (test-quantize-fns + standalone unit test); KL-divergence vs FP16 unchanged between scalar and VNNI builds.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: AVX-512-VNNI dot-products for Q1_0/Q2_0#37

ggml-cpu: AVX-512-VNNI dot-products for Q1_0/Q2_0#37
bri-prism wants to merge 1 commit into
prismfrom
cpu/avx512-vnni-q1q2-prism

bri-prism commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bri-prism commented Jun 2, 2026

What

Why

How

Performance (EPYC 9655, AVX-512-VNNI)

Correctness

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant