Skip to content

ggml-cpu: AVX-512-VNNI dot-products for Q1_0/Q2_0#37

Open
bri-prism wants to merge 1 commit into
prismfrom
cpu/avx512-vnni-q1q2-prism
Open

ggml-cpu: AVX-512-VNNI dot-products for Q1_0/Q2_0#37
bri-prism wants to merge 1 commit into
prismfrom
cpu/avx512-vnni-q1q2-prism

Conversation

@bri-prism
Copy link
Copy Markdown

What

Adds an AVX-512-VNNI / AVX-512VL fast path for the Q1_0 and Q2_0 CPU dot products. Both formats previously had no x86 vec_dot path — arch-fallback routed the _generic functions to a scalar loop. The scalar path is preserved as the #else fallback (guarded by __AVX512VNNI__ && __AVX512VL__).

Why

The low-bit CPU dot product is the hot kernel for Bonsai inference on CPU backends. Q2_0 in particular was leaving a large amount of x86 throughput on the table.

How

  • helper ggml_hsum_i32_8_vnni to reduce _mm256_dpbusd_epi32 accumulators
  • Q1_0: build a sign mask from the bit field, blend +qy/-qy, accumulate with dpbusd(ones, sel)
  • Q2_0: vectorized 2-bit unpack (replicate-4 + 16-bit shift/mask + pack), then dpbusd(codes, qy) - dpbusd(ones, qy) = sum((code-1)*qy)

Performance (EPYC 9655, AVX-512-VNNI)

Format prefill decode
Q2_0 ~3.9× ~3.0×
Q1_0 ~parity (the ±1 scalar loop already auto-vectorizes)

Scales: holds 3.4–4× through -d 2048; scalar Q2_0 times out at -d 8192.

Correctness

  • Bit-exact vs the scalar reference: standalone unit test (2000+ random blocks) + test-quantize-fns dot-product error identical with/without the path.
  • KL divergence vs FP16 (Bonsai-8B, packed_models_KL_validation flow): Q2_0 mean KLD 0.000135, top-1 99.28%, PPL +0.14% — passes thresholds. Scalar and VNNI builds produce byte-identical KLD/top-1, confirming the kernel preserves outputs end-to-end.
  • Verified to build + run on this prism tree (above numbers reproduced here).

Notes

This is the public-fork counterpart of the approved internal PR (llama.cpp-private#1), landed on prism per review. The one red CI signal — test-quantize-fns on q2_0 — is pre-existing on prism (the {−1,0,1,2} 2-bit format exceeds the generic 2-bit threshold); it reports identical numbers with and without this change.

Q1_0/Q2_0 had no x86 vec_dot path (arch-fallback routed the generic
functions to a scalar loop). Add an AVX-512-VNNI/AVX-512VL fast path
guarded by __AVX512VNNI__ && __AVX512VL__, scalar fallback otherwise:

- helper ggml_hsum_i32_8_vnni to reduce _mm256_dpbusd_epi32 accumulators
- Q1_0: build a sign mask from the bit field, blend +qy/-qy, accumulate
  with dpbusd(ones, sel)
- Q2_0: vectorized 2-bit unpack (replicate-4 + 16-bit shift/mask + pack),
  then dpbusd(codes, qy) - dpbusd(ones, qy) = sum((code-1)*qy)

Q2_0 prefill ~3.9x / decode ~3.0x vs scalar on EPYC 9655; Q1_0 ~parity
(the +/-1 scalar loop already auto-vectorizes). Bit-exact vs scalar
(test-quantize-fns + standalone unit test); KL-divergence vs FP16
unchanged between scalar and VNNI builds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant