ggml-cpu: AVX-512-VNNI dot-products for Q1_0/Q2_0#37
Open
bri-prism wants to merge 1 commit into
Open
Conversation
Q1_0/Q2_0 had no x86 vec_dot path (arch-fallback routed the generic functions to a scalar loop). Add an AVX-512-VNNI/AVX-512VL fast path guarded by __AVX512VNNI__ && __AVX512VL__, scalar fallback otherwise: - helper ggml_hsum_i32_8_vnni to reduce _mm256_dpbusd_epi32 accumulators - Q1_0: build a sign mask from the bit field, blend +qy/-qy, accumulate with dpbusd(ones, sel) - Q2_0: vectorized 2-bit unpack (replicate-4 + 16-bit shift/mask + pack), then dpbusd(codes, qy) - dpbusd(ones, qy) = sum((code-1)*qy) Q2_0 prefill ~3.9x / decode ~3.0x vs scalar on EPYC 9655; Q1_0 ~parity (the +/-1 scalar loop already auto-vectorizes). Bit-exact vs scalar (test-quantize-fns + standalone unit test); KL-divergence vs FP16 unchanged between scalar and VNNI builds.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds an AVX-512-VNNI / AVX-512VL fast path for the
Q1_0andQ2_0CPU dot products. Both formats previously had no x86vec_dotpath —arch-fallbackrouted the_genericfunctions to a scalar loop. The scalar path is preserved as the#elsefallback (guarded by__AVX512VNNI__ && __AVX512VL__).Why
The low-bit CPU dot product is the hot kernel for Bonsai inference on CPU backends.
Q2_0in particular was leaving a large amount of x86 throughput on the table.How
ggml_hsum_i32_8_vnnito reduce_mm256_dpbusd_epi32accumulators+qy/-qy, accumulate withdpbusd(ones, sel)dpbusd(codes, qy) - dpbusd(ones, qy)=sum((code-1)*qy)Performance (EPYC 9655, AVX-512-VNNI)
Scales: holds 3.4–4× through
-d 2048; scalar Q2_0 times out at-d 8192.Correctness
test-quantize-fnsdot-product error identical with/without the path.packed_models_KL_validationflow): Q2_0 mean KLD 0.000135, top-1 99.28%, PPL +0.14% — passes thresholds. Scalar and VNNI builds produce byte-identical KLD/top-1, confirming the kernel preserves outputs end-to-end.prismtree (above numbers reproduced here).Notes
This is the public-fork counterpart of the approved internal PR (
llama.cpp-private#1), landed onprismper review. The one red CI signal —test-quantize-fnsonq2_0— is pre-existing onprism(the{−1,0,1,2}2-bit format exceeds the generic 2-bit threshold); it reports identical numbers with and without this change.