Optimize: EP4 intranode kernel for FP4 dispatch + FP8 combine by jhchouuu · Pull Request #170 · ROCm/mori

jhchouuu · 2026-02-17T05:44:53Z

Motivation

Optimize EP4 intranode dispatch+combine latency for FP4 dispatch + BF16→FP8 combine through kernel-level optimizations and launch configuration tuning.

Technical Details

Block-level grid barrier: Replace per-warp atomicAdd with __syncthreads() + per-block atomicAdd in both dispatch and combine kernels, reducing atomic contention by ~4x.
Combine source pointer compaction: Use __ballot + __popcll to compact non-null source pointers before accumulation. EP4 deduplication leaves ~50% null pointers; compaction halves the accumulation work. EP8 skips compaction automatically (nearly all pointers valid, ~3 cycle overhead).
EP4 launch config tuning: Sweep block_num × warp_per_block space for 64/128/256 tokens and update EP4 defaults in bench_dispatch_combine.py.
Tuning script: Add intranode_tuning.sh.

Test Result

EP4 Dispatch + Combine total latency (FP4 dispatch + FP8 combine, P2P write, MI355X 4-GPU):

Tokens/Rank	Before	After	Speedup
64	72 µs	44 µs	1.64x
128	102 µs	49 µs	2.08x
256	178 µs	80 µs	2.23x

Correctness verified (bench warmup check) and 100k-round stress tested across EP4 FP4+FP8, EP4 BF16 zero-copy, EP8 FP4+FP8, and EP8 BF16 — all pass.

jhchouuu added 3 commits February 16, 2026 15:42

refine intranode benchmark

80908f4

Optimize: optimize EP4 intranode && add tuning script

851baea

Fix: fix EP8 benchmark perf downgrade

c5b9769

jhchouuu merged commit e3cab4b into main Feb 17, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize: EP4 intranode kernel for FP4 dispatch + FP8 combine#170

Optimize: EP4 intranode kernel for FP4 dispatch + FP8 combine#170
jhchouuu merged 3 commits intomainfrom
jiahzhou/EP4_optimize

jhchouuu commented Feb 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

jhchouuu commented Feb 17, 2026

Motivation

Technical Details

Test Result

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments