llama.cpp-gfx906-2602

UPDATE: correctness seems better now

llama.cpp-gfx906-2602

Based on llama.cpp build 7924.

Benchmark Results

See SCRIPT_llama_bench.sh for llama-bench configuration and SCRIPT_launch_server_MI50.sh for server launch settings.

What Changed

The core modifications are implemented in ggml-cuda/gfx906 folder.

2602

  ggml/src/ggml-cuda/gfx906/
  ├── gfx906-common.cuh          - DPP warp reductions & common utilities
  ├── gfx906-config.h            - Feature toggles
  ├── attention/
  │   ├── fattn-q8.cuh           - Q8 FlashAttention kernel
  │   ├── fattn-q8.cu            - Instance launcher
  │   ├── rope.cuh               - Optimized RoPE kernel
  │   └── instances/             - Template instantiations for various head dims
  ├── fused/
  │   ├── gather-q8.cuh          - Q8 gather helpers
  │   ├── gather-q8.cu           - Q8 gather kernel
  │   ├── graph-fusion.cuh       - Graph fusion logic
  │   ├── mmq-prequantized.cuh   - Prequantized MMQ helpers
  │   ├── norm-fused-q8.cuh      - Fused norm dispatch
  │   └── norm-fused-q8.cu       - Fused norm kernels
  ├── matmul/
  │   ├── mmf.cuh                - MMF (mul-mat-fused) helpers
  │   ├── mmq.cuh                - MMQ vectorized loads
  │   ├── mmq-prefetch.cuh       - Prefetch helpers
  │   ├── mmvq-q4_0.cuh          - Warp-cooperative MMVQ Q4_0
  │   ├── mmvq-q4_1.cuh          - Warp-cooperative MMVQ Q4_1
  │   ├── mmvq-q8_0.cuh          - Warp-cooperative MMVQ Q8_0
  │   └── sgemm.cuh              - SGEMM helpers
  └── quantize/
      ├── epilogue.cuh           - DPP-based Q8_1 epilogue
      ├── q8-cache.cuh           - Q8 cross-op cache
      └── vecdotq.cuh            - MXFP4 vectorized loads

2601

gfx906-mmvq-q4_0.cuh Warp-cooperative Q4_0 MMVQ kernel
gfx906-mmvq-q4_1.cuh Warp-cooperative Q4_1 MMVQ kernel
gfx906-mmvq-q8_0.cuh Warp-cooperative Q8_0 MMVQ kernel
mmvq.cu              Half-warp (32 threads) dispatch for MoE small matrices

2512

mmq.cuh              Software pipelining for Q8_0 MMQ loads
mmq.cuh              Optimized Q8 MMQ need_check path to avoid LDS conflicts
mmq.cuh              MXFP4 load pipeline with e8m0 conversion optimization
vecdotq.cuh          Fast Q8_0 load path using memcpy
vecdotq.cuh          Software pipeline MXFP4 MMVQ for v_perm latency hiding
vecdotq.cuh          MXFP4 lookup with 2-perm + arithmetic sign
mmq.cu/mmid.cu       MoE sub-warp shuffle fix for wavefront64 (fixes gpt-oss loading problems)

2511

common.cuh           DPP-based warp reductions with unified shuffle XOR dispatch
fattn-common.cuh     GCN-optimized thread counts and tile configurations
fattn.cu             Q8-optimized tile kernel selection for GFX906 flash attention
mmq.cu               Integrated GFX906 vectorized loads for Q4_0/Q4_1 quantizations
gfx906/              New directory with MI50/MI60-specific kernel implementations

Quick Start

Optional but sometimes required, set your paths for rocm and device libs if they are not in /opt/rocm/

export ROCM_PATH=/opt/rocm-7.1.0 #optional
export HIP_DEVICE_LIB_PATH=/opt/rocm-7.1.0/amdgcn/bitcode #optional

git clone https://github.com/iacopPBK/llama.cpp-gfx906.git
cd llama.cpp-gfx906
./SCRIPT_compile_MI50.sh      # edit ROCM_PATH if not using /opt/rocm
./SCRIPT_launch_server_MI50.sh # edit MODEL_PATH to your model file
./SCRIPT_llama_bench.sh # edit MODEL_PATH to your model file, performs the bench shown above

Tested with ROCm 7.1.1 and GFX906 GPU (MI50/MI60).

Power Scaling

Performance scales with power limit using SCRIPT_overclock_upp_MI50.sh for MI50 overclocking via UPP (Powerplay Table Editor). Results gathered using 2511 release.

Special Thanks and Links

Props to these users for spending time on the repo.

@fuutott ・ @mircoboschi ・ @skyne98 ・ @kamali-lab

AMD GCN ISA ・ llama.cpp ・ ROCm ・ GFX906 DISCORD ・ wiki-gfx906 ・ llama-labs-gfx906

_{Built for the GFX906 community}

Name		Name	Last commit message	Last commit date
Latest commit History 7,973 Commits
.devops		.devops
.gemini		.gemini
.github		.github
benches		benches
ci		ci
cmake		cmake
common		common
docs		docs
examples		examples
ggml		ggml
gguf-py		gguf-py
grammars		grammars
include		include
licenses		licenses
media		media
models		models
pocs		pocs
requirements		requirements
scripts		scripts
src		src
tests		tests
tools		tools
vendor		vendor
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.ecrc		.ecrc
.editorconfig		.editorconfig
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
AUTHORS		AUTHORS
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SCRIPT_compile_MI50.sh		SCRIPT_compile_MI50.sh
SCRIPT_launch_server_MI50.sh		SCRIPT_launch_server_MI50.sh
SCRIPT_llama_bench.sh		SCRIPT_llama_bench.sh
SCRIPT_llama_bench2.sh		SCRIPT_llama_bench2.sh
SCRIPT_overclock_upp_MI50.sh		SCRIPT_overclock_upp_MI50.sh
SECURITY.md		SECURITY.md
benchmarks.svg		benchmarks.svg
build-xcframework.sh		build-xcframework.sh
compilation_log.txt		compilation_log.txt
convert_hf_to_gguf.py		convert_hf_to_gguf.py
convert_hf_to_gguf_update.py		convert_hf_to_gguf_update.py
convert_llama_ggml_to_gguf.py		convert_llama_ggml_to_gguf.py
convert_lora_to_gguf.py		convert_lora_to_gguf.py
discover_kernels.py		discover_kernels.py
flake.lock		flake.lock
flake.nix		flake.nix
mypy.ini		mypy.ini
nix		nix
poetry.lock		poetry.lock
power_sweep_pp.svg		power_sweep_pp.svg
power_sweep_tg.svg		power_sweep_tg.svg
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UPDATE: correctness seems better now

llama.cpp-gfx906-2602

Benchmark Results

What Changed

2602

2601

2512

2511

Quick Start

Power Scaling

Special Thanks and Links

About

Uh oh!

Releases 7

Packages

Uh oh!

Languages

License

iacopPBK/llama.cpp-gfx906

Folders and files

Latest commit

History

Repository files navigation

UPDATE: correctness seems better now

llama.cpp-gfx906-2602

Benchmark Results

What Changed

2602

2601

2512

2511

Quick Start

Power Scaling

Special Thanks and Links

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Languages

Packages