RDNA4 Llama Experiments — Squeezing Every Token/s from the R9700 #21043

JohnTDI-cpu · 2026-03-26T20:35:48Z

JohnTDI-cpu
Mar 26, 2026

50+ experiments over several days to find every optimization that matters for llama.cpp Vulkan on RDNA4. All benchmarks were run and verified manually on real hardware. Claude (Anthropic) assisted throughout — helping analyze results, suggest hypotheses for unexpected findings (like the PCIe ASPM discovery), and structure this document. Full results below.

System Configuration

Component	Details
GPU	AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 32 GB GDDR6, 64 CUs)
Memory bandwidth	640 GB/s (MCLK 1258 MHz, level 5/5 — verified during every test)
PCIe	PCIe 5.0 x16, 32 GT/s
CPU	AMD Ryzen 9 9900X 12-Core
RAM	64 GB DDR5
OS	Ubuntu 24.04.4 LTS
Kernel	6.19.8-061908-generic
Mesa (RADV)	25.2.8-0ubuntu0.24.04.1
AMDVLK	Installed alongside RADV
llama.cpp	commit `dc8d14c58` (build 8554)
Build	`cmake -DGGML_VULKAN=ON -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=Release`

Driver identification

RADV reports: AMD Radeon AI PRO R9700 (RADV GFX1201) (radv)
AMDVLK reports: AMD Radeon AI PRO R9700 (AMD open-source driver)

All benchmarks use explicit VK_ICD_FILENAMES to guarantee driver selection.

# RADV:
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json GGML_VK_VISIBLE_DEVICES=1
 
# AMDVLK:
VK_ICD_FILENAMES=/etc/vulkan/icd.d/amd_icd64.json  # use -dev Vulkan1 for dGPU

Models Tested

Model	Type	Total Params	Active/Token	File Size	Quantization
Qwen3.5-35B-A3B	MoE	34.66B	~3.5B	18.32 GiB	UD-Q4_K_XL
Qwen3.5-27B	Dense	26.90B	26.90B	15.58 GiB	Q4_K_M

Results: Qwen3.5-35B-A3B (MoE, 35B total, ~3.5B active)

Decode

FA ON, 3 reps, values in tokens/s.

Context	Stock RADV	Stock AMDVLK	RADV+ub2048	RADV+rm_kq1+ub2048	AMDVLK+rm_kq1
tg128	147.8	153.2	147.9	149.2	156.3
tg512	146.4	151.4	146.0	148.7	155.3
tg2048	143.9	148.4	144.4	146.8	—

Prefill

FA ON, 3 reps, values in tokens/s.

Prompt	Stock RADV	Stock AMDVLK	RADV+ub2048	RADV+rm_kq1+ub2048	FA OFF+ub2048
pp128	1,210	1,134	1,209	1,207	—
pp512	2,404	1,831	2,400	2,400	2,398
pp2048	2,381	1,823	3,074	3,075	2,983
pp8192	2,262	1,742	2,983	2,984	—

Results: Qwen3.5-27B (Dense, 27B)

Decode

Context	Stock RADV	Stock AMDVLK	RADV+ub2048	RADV+rm_kq1+ub2048	AMDVLK+rm_kq1
tg128	29.07	29.01	29.01	29.31	32.75
tg512	29.08	29.15	29.05	29.32	32.82
tg2048	28.97	28.96	28.98	29.25	—

Prefill

Prompt	Stock RADV	Stock AMDVLK	RADV+ub2048	RADV+rm_kq1+ub2048	FA OFF+ub2048
pp128	631	182	631	631	—
pp512	798	203	799	800	795
pp2048	799	202	823	823	806
pp8192	772	199	797	797	—

RADV vs AMDVLK

Metric	RADV	AMDVLK	Winner
35B MoE decode tg128	147.8	153.2	AMDVLK +3.7%
35B MoE prefill pp512	2,404	1,831	RADV +31%
35B MoE prefill pp2048	2,381	1,823	RADV +31%
27B Dense decode tg128	29.07	29.01	Same
27B Dense prefill pp512	798	203	RADV +293%

RADV wins overall. AMDVLK has a moderate decode advantage on MoE (+3.7%), but RADV's prefill is dramatically faster, especially on dense models where AMDVLK is nearly 4× slower.

Optimization Impact (RADV)

Optimization	35B decode	35B prefill pp2048	27B decode	27B prefill pp2048
Stock	147.8	2,381	29.07	799
+ `-ub 2048`	147.9 (+0%)	3,074 (+29%)	29.01 (+0%)	823 (+3%)
+ `rm_kq=1`	149.2 (+1%)	3,075 (+0%)	29.31 (+1%)	823 (+0%)
+ FA ON	—	+3% vs FA OFF	—	+2% vs FA OFF

`rm_kq=1` code change

One line in ggml/src/ggml-vulkan/ggml-vulkan.cpp:

uint32_t rm_kq = 1; // was 2; reduces VGPR pressure, improves RDNA4 occupancy

AMDVLK + rm_kq=1 (surprise finding)

Model	AMDVLK stock	AMDVLK+rm_kq1	Improvement
35B MoE tg128	153.2	156.3	+2.0%
27B Dense tg128	29.01	32.75	+12.9%

rm_kq=1 has a large effect on AMDVLK dense decode (+13%), much more than RADV (+1%). This suggests AMDVLK's LLPC compiler benefits more from reduced register pressure on RDNA4.

Quality & VRAM Verification

Qwen3.5-35B-A3B — WikiText-2 Perplexity

Config	PPL	Model VRAM	KV VRAM	Compute VRAM	Total
Stock (rm_kq=2)	6.9472 ± 0.046	18,492 MiB	40 MiB	498 MiB	19,030 MiB
rm_kq=1	6.9472 ± 0.046	18,492 MiB	40 MiB	498 MiB	19,030 MiB
rm_kq=1 + ub=2048	6.9472 ± 0.046	18,492 MiB	40 MiB	498 MiB	19,030 MiB

PPL and VRAM identical across all configurations. No quality or memory impact from any optimization.

Reproduction

# Build
git clone https://github.com/ggerganov/llama.cpp.git && cd llama.cpp
cmake -B build -DGGML_VULKAN=ON -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc) -- llama-bench
 
# Verify setup
cat /sys/class/drm/card1/device/pp_dpm_mclk | grep "*"  # must show 1258Mhz
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json \
GGML_VK_VISIBLE_DEVICES=1 ./build/bin/llama-bench --list-devices  # must show "(RADV GFX1201)"
 
# Stock benchmark
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json GGML_VK_VISIBLE_DEVICES=1 \
./build/bin/llama-bench -m MODEL.gguf -t 1 -ngl 99 -fa 1 \
  -p 128,512,2048,8192 -n 128,512,2048 -r 3
 
# Optimized (add -ub 2048 for prefill boost)
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json GGML_VK_VISIBLE_DEVICES=1 \
./build/bin/llama-bench -m MODEL.gguf -t 1 -ngl 99 -fa 1 \
  -p 128,512,2048,8192 -n 128,512,2048 -ub 2048 -b 16384 -r 3

Exhaustive Flag Testing

Qwen3.5-35B-A3B (MoE) — Decode tg128, rm_kq=1 active

RADV experiments

Flag	tg128	vs stock 149.5
Stock (rm_kq=1, no env)	149.5	—
+ gfx queue	149.2	-0.2%
+ nocompute	149.3	-0.1%
+ gfx + nocompute	149.4	-0.1%
+ disable coopmat	149.3	-0.1%
+ force MMVQ	149.5	0%
+ disable int dot	149.3	-0.1%
+ FA OFF	149.3	-0.1%
+ disable F16	149.5	0%
+ q8_0 KV	148.8	-0.5%
+ disable graph opt	148.6	-0.6%
+ bolist	149.7	+0.1%
+ SAM	148.8	-0.5%
+ disable fusion	121.8	-18.5%

gfx queue has zero effect on RADV 35B MoE decode. Disable fusion catastrophically hurts.

AMDVLK experiments

Flag	tg128	vs stock 156.3
Stock (rm_kq=1, no env)	156.3	—
+ gfx queue	163.7	+4.7%
+ gfx + memory_priority	163.9	+4.9%
+ gfx + no-host	163.9	+4.9%
+ gfx + nopo	164.0	+4.9%
+ gfx + MMVQ disabled	164.0	+4.9%
+ gfx + coopmat2 disabled	163.7	+4.7%
+ gfx + F16 disabled	163.9	+4.9%
+ gfx + disable int dot	162.5	+4.0%
+ gfx + disable graph opt	157.3	+0.6%
+ gfx + disable coopmat	161.5	+3.3%
+ gfx + FA OFF	162.6	+4.0%
+ gfx + q4_0 KV	163.2	+4.4%
+ gfx + q8_0 KV	163.2	+4.4%
+ gfx + host memory	9.3	-94.0%
+ gfx + disable fusion	—	—

gfx queue gives +4.7% on AMDVLK 35B MoE. No other flag breaks through 164 t/s.

Qwen3.5-27B (Dense) — Decode tg128, rm_kq=1 active

RADV experiments

Flag	tg128	vs stock 29.30
Stock (rm_kq=1)	29.30	—
+ gfx queue	29.31	0%
+ nocompute	29.27	-0.1%
+ disable coopmat	29.26	-0.1%
+ force MMVQ	29.30	0%
+ disable int dot	29.32	+0.1%
+ FA OFF	29.35	+0.2%
+ disable F16	29.35	+0.2%
+ q8_0 KV	29.21	-0.3%
+ q4_0 KV	29.19	-0.4%
+ disable graph opt	28.99	-1.1%
+ disable fusion	27.81	-5.1%
+ SAM	29.27	-0.1%

Nothing moves RADV 27B decode. 29.3 t/s = hard BW ceiling (15.58 GiB × 29.3 = 456 GB/s = 71% of 640 GB/s).

AMDVLK experiments

Flag	tg128	vs stock 32.73
Stock (rm_kq=1, NO gfx)	32.73	— (BEST!)
+ gfx queue	30.07	-8.1%
+ gfx + disable coopmat	29.93	-8.6%
+ gfx + force MMVQ	30.03	-8.2%
+ gfx + disable int dot	30.08	-8.1%
+ gfx + disable graph opt	29.72	-9.2%
+ gfx + FA OFF	30.06	-8.2%
+ gfx + q8_0 KV	29.99	-8.4%
+ gfx + disable fusion	28.61	-12.6%
+ gfx + memory_priority	30.04	-8.2%
rm_kq=2 (default, no gfx)	28.97	-11.5%

AMDVLK + rm_kq=1 without gfx = best dense decode (32.73 t/s, +13% over stock rm_kq=2!)
gfx queue HURTS dense AMDVLK by -8% — opposite of MoE where it helps +4.7%.

rm_kq impact across all configs

Model	Driver	rm_kq=2	rm_kq=1	Improvement
35B MoE	RADV	147.8	149.5	+1.1%
35B MoE	AMDVLK (no gfx)	153.2	156.3	+2.0%
35B MoE	AMDVLK+gfx	—	163.7	—
27B Dense	RADV	29.07	29.30	+0.8%
27B Dense	AMDVLK (no gfx)	28.97	32.73	+13.0%

rm_kq=1 has the largest impact on AMDVLK dense decode (+13%). This suggests AMDVLK's LLPC compiler benefits significantly from reduced VGPR pressure on RDNA4 wave32 architecture. RADV's ACO compiler handles register allocation differently, gaining less from the same change.

Best Achievable Performance

35B MoE

	RADV	AMDVLK
Best decode	149.5 (rm_kq=1)	163.7 (gfx+rm_kq=1)
Best prefill pp2048	3,075 (ub=2048)	2,170 (gfx+ub=2048)

27B Dense

	RADV	AMDVLK
Best decode	32.5 (rm_kq=1, ASPM perf)	33.2 (rm_kq=1, ASPM perf, NO gfx!)
Best prefill pp2048	993 (Mesa 25.3.6, ASPM, ub=2048)	207 (ub=2048)

Dense decode improved by +10.8% on RADV and +14.5% on AMDVLK (vs stock rm_kq=2 + ASPM default) from combined rm_kq=1 + PCIe ASPM performance mode.

Key findings

Decode is 100% memory bandwidth bound. No flag or parameter breaks through the ceiling.
rm_kq=1 is the single most impactful code change: +1% RADV, +2% AMDVLK MoE, +13% AMDVLK dense.
gfx queue helps AMDVLK MoE (+4.7%) but hurts AMDVLK dense (-8%). Zero effect on RADV.
Use gfx queue for MoE, disable for dense when running AMDVLK.
RADV is the best single driver (best prefill on all models, competitive decode).

PCIe ASPM Discovery

Setting PCIe ASPM to performance mode eliminates L1 exit latency:

echo "performance" | sudo tee /sys/module/pcie_aspm/parameters/policy

Model	Driver	ASPM default	ASPM perf	Change
27B Dense	RADV	29.30	32.46	+10.8%
27B Dense	AMDVLK	32.73	33.17	+1.3%
35B MoE	RADV	149.5	149.4	0%
35B MoE	AMDVLK+gfx	163.0	163.9	+0.5%

ASPM L1 power saving adds latency to every PCIe transaction. Dense models suffer most because they read the entire model (~15.6 GB) every token with many small transactions. MoE models batch work more efficiently, hiding PCIe latency.

This is a system-level optimization — no code change, no driver change. Persists until reboot. To make permanent: add pcie_aspm.policy=performance to kernel boot parameters.

Known Issues

Kernel 6.19 RADV decode regression (theory): ~10% slower than kernel 6.17.0-14. AMDVLK unaffected. Suspected root cause in amdgpu DRM scheduler, but not yet confirmed via bisect.
AMDVLK dense prefill significantly slower: 4× slower than RADV on 27B dense (207 vs 823 t/s). Disabling coopmat (GGML_VK_DISABLE_COOPMAT=1) improves AMDVLK dense prefill by +17% (207→243) — suggests AMDVLK's cooperative matrix codegen is suboptimal for dense models. RADV's coopmat works correctly.
MCLK stuck on kernel 6.17: MCLK won't boost to 1258 MHz on kernel 6.17.0-14/19. Works fine on 6.19.8.
PCIe ASPM default = powersave: Resets on every reboot. Dense models lose 10% until set to performance.

Exhaustive Experiment Log (50+ combinations tested)

Parameters with REAL impact

Discovery	Model	Driver	Gain	How
PCIe ASPM=performance	Dense (all)	RADV	+10.8% decode	`echo performance > /sys/module/pcie_aspm/parameters/policy`
PCIe ASPM=performance	30B MoE	RADV	+10% decode	same
rm_kq=1	Dense 27B	AMDVLK	+13% decode	1 line in ggml-vulkan.cpp
rm_kq=1	MoE 35B	AMDVLK	+2% decode	same
-ub 2048	MoE 35B	RADV	+29% prefill pp2048	`-ub 2048 -b 16384`
Mesa 25.3.6	MoE 35B	RADV	+48% prefill pp2048	custom Mesa build
Mesa 25.3.6	Dense 27B	RADV	+21% prefill pp2048	custom Mesa build
gfx queue	MoE 35B	AMDVLK	+4.7% decode	`GGML_VK_ALLOW_GRAPHICS_QUEUE=1`
disable coopmat	Dense 27B	AMDVLK	+17% prefill	`GGML_VK_DISABLE_COOPMAT=1`

Parameters with ZERO impact (all tested, all confirmed ±0.3%)

RADV flags: gfx queue (on RADV), RADV_DEBUG=nocompute, RADV_PERFTEST=sam/bolist/localbos/dmashaders/nircache/hic/nogttspill, RADV_PROFILE_PSTATE

llama.cpp env vars: GGML_VK_DISABLE(F16/BF16/COOPMAT2/INTEGER_DOT_PRODUCT/ASYNC/GRAPH_OPTIMIZE), GGML_VK_FORCE_MMVQ, GGML_VK_DISABLE_MMVQ, GGML_VK_DMMV_LARGE, GGML_VK_ENABLE_MEMORY_PRIORITY, GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM, GGML_VK_FORCE_MAX_ALLOCATION_SIZE, GGML_VK_FORCE_MAX_BUFFER_SIZE, GGML_VK_SUBALLOCATION_BLOCK_SIZE (16MB and 1GB)

llama.cpp params: -t 1/4/12 (thread count), --no-host, -nopo (no-op-offload), -dio (direct-io), -mmp 0 (no mmap), -sm row (split mode), -b 1/2/512 (batch size), --prio 2 (scheduling priority), -ctk/-ctv q8_0/q4_0 (KV cache quant)

Code changes: rm_stdq=2, rm_kq_int=2, rm_stdq_int=2, rm_kq=3/4

System tuning: hugepages (16GB), transparent hugepages=always, CPU pinning (taskset), nice -n -20, GPU power profile (COMPUTE/3D_FULL_SCREEN)

DISABLE_FUSION is catastrophic: -18.5% on MoE, -5.1% on dense. Never disable.

Bandwidth utilization analysis

Model	Best decode	Model size	BW used	BW peak	Utilization
35B MoE (AMDVLK+gfx)	163.8 t/s	~2.4 GB/token	393 GB/s	640 GB/s	61%
35B MoE (RADV)	149.5 t/s	~2.4 GB/token	359 GB/s	640 GB/s	56%
27B Dense (AMDVLK+ASPM)	33.2 t/s	15.58 GB/token	517 GB/s	640 GB/s	81%
27B Dense (RADV+ASPM)	32.5 t/s	15.58 GB/token	506 GB/s	640 GB/s	79%

Dense models reach 79-83% BW utilization with ASPM fix. MoE models are lower (56-61%) due to dispatch overhead from expert routing. The remaining 17-20% gap on dense is primarily from:

ACO compiler inefficiency (31 redundant s_wait_kmcnt per Q4K GEMV shader)
Cache line waste (Q4K 144B blocks in 192B cache lines = 75% utilization)
Memory controller overhead (DRAM refresh, row precharge)

Please share your discoveries too — I'm curious what's the max we can get out of RDNA4.

zedbytes · 2026-03-27T12:28:03Z

zedbytes
Mar 27, 2026

@JohnTDI-cpu thanks again, do you mind sharing huggingface links to the two models you used

1 reply

JohnTDI-cpu Mar 27, 2026
Author

https://huggingface.co/JohnTdi/Qwen3.5-Unsloth-GGUF-R9700-Benchmark

The 35B file was downloaded from unsloth/Qwen3.5-35B-A3B-GGUF on 2025-02-25, I see now unsloth update oryginal file

zedbytes · 2026-03-27T17:59:15Z

zedbytes
Mar 27, 2026

@JohnTDI-cpu

All my stats seem to perform better overall. Also i'm running this on a custom LACT R9700 Profile
power : 210W
GPU Clock Offset : -500 MHz
Maximum VRAM Clock : 2518 MHz
Minimum VRAM Clock : 194 MHz
GPU voltage Offset : -88 mV

Differences I can tell which may have mattered are
GGML_VK_ALLOW_GRAPHICS_QUEUE=1
Mesa RADV 26.0.3
llama.cpp 8555

Models (GGUF)

Role	File	Quant
MoE 35B	`Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf`	UD-Q4_K_XL
Dense 27B	`Qwen3.5-27B-Q4_K_M.gguf`	Q4_K_M

System Configuration

Component	Details
GPU	2× AMD Radeon AI PRO R9700 at PCIE gen 3 x8 ; AMD Radeon PRO W6600 at PCIE gen 3 x4
Memory bandwidth	640 GB/s (MCLK 1258 MHz, level 5/5)
PCIe	PCIe 3.0 x8/x8/x4
CPU	Intel Core i9-9900K (8-core / 16-thread)
RAM	64 GB DDR4
OS	Ubuntu 24.04.4 LTS
Kernel	6.19.8-061908-generic
Mesa (RADV)	Mesa 26.0.3 - kisak-mesa PPA (`mesa-vulkan-drivers`)
AMDVLK	Not installed
llama.cpp	commit `48cda24c1175363dd17925102dbe1da49279940e` (build 8555)
Build	`cmake -B build -DGGML_VULKAN=ON -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=Release`

Test Configuration

Environment

Variable	Value
`GGML_VK_VISIBLE_DEVICES`	`1` — single R9700 (first PCIe slot)
`GGML_VK_ALLOW_GRAPHICS_QUEUE`	`1`

`llama-bench` flags (every run)

Flag	Value
`-t` / threads	`1`
`-ngl`	`99`
`-fa`	`1` (flash attention on)
`-p`	`128,512,2048,8192` (prefill sizes)
`-n`	`128,512,2048` (decode / generation lengths)
`-r`	`3` (repetitions)

Not set: VK_ICD_FILENAMES

Configs (columns in results)

Column	Binary / batching
Stock RADV	Default `llama-bench` batching: `-b` 2048, `-ub` 512 (no extra flags). `ggml-vulkan.cpp`: `rm_kq = 2` (upstream default).
RADV+ub2048	Same binary as stock; add `-ub 2048` and `-b 16384`.
RADV+rm_kq1+ub2048	Rebuild with `uint32_t rm_kq = 1` in `ggml/src/ggml-vulkan/ggml-vulkan.cpp` (line that defaults to `2`); same flags as RADV+ub2048.

Backend is Vulkan / RADV via the cmake build (GGML_VULKAN=ON).

Detailed results: Qwen3.5-35B-A3B (MoE)

Decode

Context	Stock RADV	RADV+ub2048	RADV+rm_kq1+ub2048
tg128	153.4	153.1	155.6
tg512	152.3	151.7	151.8
tg2048	149.7	150.4	152.3

Prefill

Prompt	Stock RADV	RADV+ub2048	RADV+rm_kq1+ub2048
pp128	1839	1814	1742
pp512	3314	3265	3257
pp2048	3272	3964	3946
pp8192	3131	3846	3832

Detailed results: Qwen3.5-27B (Dense)

Decode

Context	Stock RADV	RADV+ub2048	RADV+rm_kq1+ub2048
tg128	32.25	32.17	32.06
tg512	32.27	32.21	32.06
tg2048	32.09	32.09	31.89

Prefill

Prompt	Stock RADV	RADV+ub2048	RADV+rm_kq1+ub2048
pp128	841	821	830
pp512	942	914	921
pp2048	933	923	930
pp8192	883	880	890

build: 48cda24c1 (8555)

Condensed comparison (three RADV configs)

All values are t/s from the detailed tables above. RADV+ub2048 and RADV+rm_kq1+ub2048 use absolute t/s with Δ vs Stock RADV in parentheses (percent, rounded).

Model	Test	Stock RADV	RADV+ub2048	RADV+rm_kq1+ub2048
MoE 35B	tg128	153.4	153.1 (−0.2%)	155.6 (+1.4%)
MoE 35B	pp512	3314	3265 (−1.5%)	3257 (−1.7%)
MoE 35B	pp2048	3272	3964 (+21.2%)	3946 (+20.6%)
MoE 35B	pp8192	3131	3846 (+22.8%)	3832 (+22.4%)
Dense 27B	tg128	32.25	32.17 (−0.2%)	32.06 (−0.6%)
Dense 27B	pp512	942	914 (−3.0%)	921 (−2.2%)
Dense 27B	pp2048	933	923 (−1.1%)	930 (−0.3%)
Dense 27B	pp8192	883	880 (−0.3%)	890 (+0.8%)

5 replies

JohnTDI-cpu Mar 27, 2026
Author

Thanks! I'll definitely test it with the Kisak Mesa (26.0.3) and experiment with undervolting to see if I can close the gap.
By the way, I just realized you're running 8x W6600s alongside those R9700s — that's absolutely insane! Now I see why you're so focused on LACT profiles and undervolting;

zedbytes Mar 27, 2026

Nope, I have just one W6600. I meant that it is running at PCIE gen 3 x4 speed. Two R9700 runs at PCIE gen 3 x8 speed each.
The reason to have a W6600 was to power display graphics and that's about it.

Neutralized Mar 27, 2026

Could there be a improvement that allows rm_kq to be meybe set through environment or args? Would be interesting to test this, by the way for me graphic queue on dense model brings down perf by about 25%, probably because i use 2 amd gpus that arent evenly matched and different gens also.

zedbytes Mar 28, 2026

I've done some more tests , also added pp8192 and pp16384 running split on two R9700s with their own specific flags

Here's the stats

Environment

Variable	Value
`GGML_VK_VISIBLE_DEVICES`	`2,1` — hide W6600; both R9700s visible to ggml as Vulkan0 + Vulkan1
`GGML_VK_ALLOW_GRAPHICS_QUEUE`	`1`

`llama-bench` flags (every run)

Flag	Value
`-t`	`1`
`-ngl`	`99`
`-fa`	`1` (flash attention on)
`-dev`	`Vulkan1/Vulkan0`
`-sm`	`row` (tensor split mode)
`-ts`	`0.5/0.5`
`-p`	`128,512,2048,8192,16384`
`-n`	`128,512,2048`
`-r`	`3`

Single R9700 vs Dual R9700

Stock RADV:

Model	Test	Single Stock	Dual Stock	Δ (dual vs single)
MoE 35B	tg128	155.2	114.2	−26.4%
MoE 35B	pp512	3316	2853	−14.0%
MoE 35B	pp2048	3265	4384	+34.3%
MoE 35B	pp8192	3131	4724	+50.9%
MoE 35B	pp16384	2929	4420	+50.9%
Dense 27B	tg128	32.03	26.7	−16.6%
Dense 27B	pp512	928	947	+2.0%
Dense 27B	pp2048	918	1423	+55.0%
Dense 27B	pp8192	875	1547	+76.8%
Dense 27B	pp16384	849	1519	+78.9%

RADV+ub2048:

Model	Test	Single RADV+ub2048	Dual RADV+ub2048	Δ (dual vs single)
MoE 35B	tg128	155.1	114.7	−26.0%
MoE 35B	pp512	3254	2950	−9.3%
MoE 35B	pp2048	3950	3798	−3.8%
MoE 35B	pp8192	3825	5224	+36.6%
MoE 35B	pp16384	3511	5097	+45.2%
Dense 27B	tg128	32.00	27.0	−15.6%
Dense 27B	pp512	910	940	+3.3%
Dense 27B	pp2048	927	945	+1.9%
Dense 27B	pp8192	882	1366	+54.9%
Dense 27B	pp16384	858	1433	+67.0%

RADV+rm_kq1+ub2048:

Model	Test	Single rm_kq1+ub2048	Dual rm_kq1+ub2048	Δ (dual vs single)
MoE 35B	tg128	154.7	114.9	−25.7%
MoE 35B	pp512	3261	2923	−10.4%
MoE 35B	pp2048	3947	3793	−3.9%
MoE 35B	pp8192	3828	5241	+36.9%
MoE 35B	pp16384	3512	5101	+45.2%
Dense 27B	tg128	32.05	26.9	−16.1%
Dense 27B	pp512	933	948	+1.6%
Dense 27B	pp2048	944	954	+1.1%
Dense 27B	pp8192	896	1374	+53.4%
Dense 27B	pp16384	864	1442	+67.0%

@Neutralized

by the way for me graphic queue on dense model brings down perf by about 25%, probably because i use 2 amd gpus that arent evenly matched and different gens also

Not quite the same as your set up , but I've tested dual R9700 with graphics queue turned off for 27B dense model, and here are the results. I'm still better off keeping graphics queue flag on

Stock RADV

Test	Gfx queue `=1`	Gfx queue unset	Δ (unset vs `=1`)
tg128	26.7	25.8	−3.4%
tg512	26.8	25.9	−3.4%
tg2048	26.8	25.7	−4.1%
pp128	741	747	+0.8%
pp512	947	956	+1.0%
pp2048	1423	1438	+1.1%
pp8192	1547	1570	+1.5%
pp16384	1519	1540	+1.4%

RADV + ub2048

Test	Gfx queue `=1`	Gfx queue unset	Δ (unset vs `=1`)
tg128	27.0	25.8	−4.4%
tg512	26.9	25.5	−5.2%
tg2048	26.5	25.6	−3.4%
pp128	719	735	+2.2%
pp512	940	940	0.0%
pp2048	945	936	−1.0%
pp8192	1366	1354	−0.9%
pp16384	1433	1423	−0.7%

RADV + rm_kq1 + ub2048

Test	Gfx queue `=1`	Gfx queue unset	Δ (unset vs `=1`)
tg128	26.9	25.7	−4.5%
tg512	26.7	25.8	−3.4%
tg2048	26.5	25.7	−3.0%
pp128	741	740	−0.1%
pp512	948	945	−0.3%
pp2048	954	944	−1.0%
pp8192	1374	1362	−0.9%
pp16384	1442	1429	−0.9%

Neutralized Mar 28, 2026

Interesting, my setup is 9070XT with 6700XT, i also use layer sm row which gives me better perf, but due to nature of both cards i do ts 0.55, 0.45 on average. Maybe graphic queue hurts in this setup because of the 6700XT which is older card.

RDNA4 Llama Experiments — Squeezing Every Token/s from the R9700 #21043

Uh oh!

Uh oh!

JohnTDI-cpu Mar 26, 2026

System Configuration

Driver identification

Models Tested

Results: Qwen3.5-35B-A3B (MoE, 35B total, ~3.5B active)

Decode

Prefill

Results: Qwen3.5-27B (Dense, 27B)

Decode

Prefill

RADV vs AMDVLK

Optimization Impact (RADV)

rm_kq=1 code change

AMDVLK + rm_kq=1 (surprise finding)

Quality & VRAM Verification

Qwen3.5-35B-A3B — WikiText-2 Perplexity

Reproduction

Exhaustive Flag Testing

Qwen3.5-35B-A3B (MoE) — Decode tg128, rm_kq=1 active

RADV experiments

AMDVLK experiments

Qwen3.5-27B (Dense) — Decode tg128, rm_kq=1 active

RADV experiments

AMDVLK experiments

rm_kq impact across all configs

Best Achievable Performance

35B MoE

27B Dense

Key findings

PCIe ASPM Discovery

Known Issues

Exhaustive Experiment Log (50+ combinations tested)

Parameters with REAL impact

Parameters with ZERO impact (all tested, all confirmed ±0.3%)

Bandwidth utilization analysis

Replies: 2 comments · 6 replies

Uh oh!

zedbytes Mar 27, 2026

Uh oh!

JohnTDI-cpu Mar 27, 2026 Author

Uh oh!

Uh oh!

zedbytes Mar 27, 2026

Models (GGUF)

System Configuration

Test Configuration

Environment

llama-bench flags (every run)

Configs (columns in results)

Detailed results: Qwen3.5-35B-A3B (MoE)

Detailed results: Qwen3.5-27B (Dense)

Condensed comparison (three RADV configs)

Uh oh!

JohnTDI-cpu Mar 27, 2026 Author

Uh oh!

Uh oh!

zedbytes Mar 27, 2026

Uh oh!

Neutralized Mar 27, 2026

Uh oh!

zedbytes Mar 28, 2026

Environment

llama-bench flags (every run)

Single R9700 vs Dual R9700

Stock RADV

RADV + ub2048

RADV + rm_kq1 + ub2048

Uh oh!

Neutralized Mar 28, 2026

JohnTDI-cpu
Mar 26, 2026

`rm_kq=1` code change

Replies: 2 comments 6 replies

zedbytes
Mar 27, 2026

JohnTDI-cpu Mar 27, 2026
Author

zedbytes
Mar 27, 2026

`llama-bench` flags (every run)

JohnTDI-cpu Mar 27, 2026
Author

`llama-bench` flags (every run)