RDNA4 Llama Experiments — Squeezing Every Token/s from the R9700 #21043
Replies: 2 comments 6 replies
-
|
@JohnTDI-cpu thanks again, do you mind sharing huggingface links to the two models you used |
Beta Was this translation helpful? Give feedback.
-
|
All my stats seem to perform better overall. Also i'm running this on a custom LACT R9700 Profile Differences I can tell which may have mattered are Models (GGUF)
System Configuration
Test ConfigurationEnvironment
|
| Flag | Value |
|---|---|
-t / threads |
1 |
-ngl |
99 |
-fa |
1 (flash attention on) |
-p |
128,512,2048,8192 (prefill sizes) |
-n |
128,512,2048 (decode / generation lengths) |
-r |
3 (repetitions) |
Not set: VK_ICD_FILENAMES
Configs (columns in results)
| Column | Binary / batching |
|---|---|
| Stock RADV | Default llama-bench batching: -b 2048, -ub 512 (no extra flags). ggml-vulkan.cpp: rm_kq = 2 (upstream default). |
| RADV+ub2048 | Same binary as stock; add -ub 2048 and -b 16384. |
| RADV+rm_kq1+ub2048 | Rebuild with uint32_t rm_kq = 1 in ggml/src/ggml-vulkan/ggml-vulkan.cpp (line that defaults to 2); same flags as RADV+ub2048. |
Backend is Vulkan / RADV via the cmake build (GGML_VULKAN=ON).
Detailed results: Qwen3.5-35B-A3B (MoE)
Decode
| Context | Stock RADV | RADV+ub2048 | RADV+rm_kq1+ub2048 |
|---|---|---|---|
| tg128 | 153.4 | 153.1 | 155.6 |
| tg512 | 152.3 | 151.7 | 151.8 |
| tg2048 | 149.7 | 150.4 | 152.3 |
Prefill
| Prompt | Stock RADV | RADV+ub2048 | RADV+rm_kq1+ub2048 |
|---|---|---|---|
| pp128 | 1839 | 1814 | 1742 |
| pp512 | 3314 | 3265 | 3257 |
| pp2048 | 3272 | 3964 | 3946 |
| pp8192 | 3131 | 3846 | 3832 |
Detailed results: Qwen3.5-27B (Dense)
Decode
| Context | Stock RADV | RADV+ub2048 | RADV+rm_kq1+ub2048 |
|---|---|---|---|
| tg128 | 32.25 | 32.17 | 32.06 |
| tg512 | 32.27 | 32.21 | 32.06 |
| tg2048 | 32.09 | 32.09 | 31.89 |
Prefill
| Prompt | Stock RADV | RADV+ub2048 | RADV+rm_kq1+ub2048 |
|---|---|---|---|
| pp128 | 841 | 821 | 830 |
| pp512 | 942 | 914 | 921 |
| pp2048 | 933 | 923 | 930 |
| pp8192 | 883 | 880 | 890 |
build: 48cda24c1 (8555)
Condensed comparison (three RADV configs)
All values are t/s from the detailed tables above. RADV+ub2048 and RADV+rm_kq1+ub2048 use absolute t/s with Δ vs Stock RADV in parentheses (percent, rounded).
| Model | Test | Stock RADV | RADV+ub2048 | RADV+rm_kq1+ub2048 |
|---|---|---|---|---|
| MoE 35B | tg128 | 153.4 | 153.1 (−0.2%) | 155.6 (+1.4%) |
| MoE 35B | pp512 | 3314 | 3265 (−1.5%) | 3257 (−1.7%) |
| MoE 35B | pp2048 | 3272 | 3964 (+21.2%) | 3946 (+20.6%) |
| MoE 35B | pp8192 | 3131 | 3846 (+22.8%) | 3832 (+22.4%) |
| Dense 27B | tg128 | 32.25 | 32.17 (−0.2%) | 32.06 (−0.6%) |
| Dense 27B | pp512 | 942 | 914 (−3.0%) | 921 (−2.2%) |
| Dense 27B | pp2048 | 933 | 923 (−1.1%) | 930 (−0.3%) |
| Dense 27B | pp8192 | 883 | 880 (−0.3%) | 890 (+0.8%) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
50+ experiments over several days to find every optimization that matters for llama.cpp Vulkan on RDNA4. All benchmarks were run and verified manually on real hardware. Claude (Anthropic) assisted throughout — helping analyze results, suggest hypotheses for unexpected findings (like the PCIe ASPM discovery), and structure this document. Full results below.
System Configuration
dc8d14c58(build 8554)cmake -DGGML_VULKAN=ON -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=ReleaseDriver identification
RADV reports:
AMD Radeon AI PRO R9700 (RADV GFX1201) (radv)AMDVLK reports:
AMD Radeon AI PRO R9700 (AMD open-source driver)All benchmarks use explicit
VK_ICD_FILENAMESto guarantee driver selection.Models Tested
Results: Qwen3.5-35B-A3B (MoE, 35B total, ~3.5B active)
Decode
FA ON, 3 reps, values in tokens/s.
Prefill
FA ON, 3 reps, values in tokens/s.
Results: Qwen3.5-27B (Dense, 27B)
Decode
Prefill
RADV vs AMDVLK
RADV wins overall. AMDVLK has a moderate decode advantage on MoE (+3.7%), but RADV's prefill is dramatically faster, especially on dense models where AMDVLK is nearly 4× slower.
Optimization Impact (RADV)
-ub 2048rm_kq=1rm_kq=1code changeOne line in
ggml/src/ggml-vulkan/ggml-vulkan.cpp:AMDVLK + rm_kq=1 (surprise finding)
rm_kq=1has a large effect on AMDVLK dense decode (+13%), much more than RADV (+1%). This suggests AMDVLK's LLPC compiler benefits more from reduced register pressure on RDNA4.Quality & VRAM Verification
Qwen3.5-35B-A3B — WikiText-2 Perplexity
PPL and VRAM identical across all configurations. No quality or memory impact from any optimization.
Reproduction
Exhaustive Flag Testing
Qwen3.5-35B-A3B (MoE) — Decode tg128, rm_kq=1 active
RADV experiments
gfx queue has zero effect on RADV 35B MoE decode. Disable fusion catastrophically hurts.
AMDVLK experiments
gfx queue gives +4.7% on AMDVLK 35B MoE. No other flag breaks through 164 t/s.
Qwen3.5-27B (Dense) — Decode tg128, rm_kq=1 active
RADV experiments
Nothing moves RADV 27B decode. 29.3 t/s = hard BW ceiling (15.58 GiB × 29.3 = 456 GB/s = 71% of 640 GB/s).
AMDVLK experiments
AMDVLK + rm_kq=1 without gfx = best dense decode (32.73 t/s, +13% over stock rm_kq=2!)
gfx queue HURTS dense AMDVLK by -8% — opposite of MoE where it helps +4.7%.
rm_kq impact across all configs
rm_kq=1has the largest impact on AMDVLK dense decode (+13%). This suggests AMDVLK's LLPC compiler benefits significantly from reduced VGPR pressure on RDNA4 wave32 architecture. RADV's ACO compiler handles register allocation differently, gaining less from the same change.Best Achievable Performance
35B MoE
27B Dense
Dense decode improved by +10.8% on RADV and +14.5% on AMDVLK (vs stock rm_kq=2 + ASPM default) from combined
rm_kq=1+ PCIe ASPM performance mode.Key findings
rm_kq=1is the single most impactful code change: +1% RADV, +2% AMDVLK MoE, +13% AMDVLK dense.PCIe ASPM Discovery
Setting PCIe ASPM to performance mode eliminates L1 exit latency:
ASPM L1 power saving adds latency to every PCIe transaction. Dense models suffer most because they read the entire model (~15.6 GB) every token with many small transactions. MoE models batch work more efficiently, hiding PCIe latency.
This is a system-level optimization — no code change, no driver change. Persists until reboot. To make permanent: add
pcie_aspm.policy=performanceto kernel boot parameters.Known Issues
GGML_VK_DISABLE_COOPMAT=1) improves AMDVLK dense prefill by +17% (207→243) — suggests AMDVLK's cooperative matrix codegen is suboptimal for dense models. RADV's coopmat works correctly.Exhaustive Experiment Log (50+ combinations tested)
Parameters with REAL impact
echo performance > /sys/module/pcie_aspm/parameters/policy-ub 2048 -b 16384GGML_VK_ALLOW_GRAPHICS_QUEUE=1GGML_VK_DISABLE_COOPMAT=1Parameters with ZERO impact (all tested, all confirmed ±0.3%)
RADV flags: gfx queue (on RADV), RADV_DEBUG=nocompute, RADV_PERFTEST=sam/bolist/localbos/dmashaders/nircache/hic/nogttspill, RADV_PROFILE_PSTATE
llama.cpp env vars: GGML_VK_DISABLE(F16/BF16/COOPMAT2/INTEGER_DOT_PRODUCT/ASYNC/GRAPH_OPTIMIZE), GGML_VK_FORCE_MMVQ, GGML_VK_DISABLE_MMVQ, GGML_VK_DMMV_LARGE, GGML_VK_ENABLE_MEMORY_PRIORITY, GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM, GGML_VK_FORCE_MAX_ALLOCATION_SIZE, GGML_VK_FORCE_MAX_BUFFER_SIZE, GGML_VK_SUBALLOCATION_BLOCK_SIZE (16MB and 1GB)
llama.cpp params: -t 1/4/12 (thread count), --no-host, -nopo (no-op-offload), -dio (direct-io), -mmp 0 (no mmap), -sm row (split mode), -b 1/2/512 (batch size), --prio 2 (scheduling priority), -ctk/-ctv q8_0/q4_0 (KV cache quant)
Code changes: rm_stdq=2, rm_kq_int=2, rm_stdq_int=2, rm_kq=3/4
System tuning: hugepages (16GB), transparent hugepages=always, CPU pinning (taskset), nice -n -20, GPU power profile (COMPUTE/3D_FULL_SCREEN)
DISABLE_FUSION is catastrophic: -18.5% on MoE, -5.1% on dense. Never disable.
Bandwidth utilization analysis
Dense models reach 79-83% BW utilization with ASPM fix. MoE models are lower (56-61%) due to dispatch overhead from expert routing. The remaining 17-20% gap on dense is primarily from:
s_wait_kmcntper Q4K GEMV shader)Please share your discoveries too — I'm curious what's the max we can get out of RDNA4.
Beta Was this translation helpful? Give feedback.
All reactions