Skip to content

Optimize: EP4 intranode kernel for FP4 dispatch + FP8 combine#170

Merged
jhchouuu merged 3 commits intomainfrom
jiahzhou/EP4_optimize
Feb 17, 2026
Merged

Optimize: EP4 intranode kernel for FP4 dispatch + FP8 combine#170
jhchouuu merged 3 commits intomainfrom
jiahzhou/EP4_optimize

Conversation

@jhchouuu
Copy link
Collaborator

Motivation

Optimize EP4 intranode dispatch+combine latency for FP4 dispatch + BF16→FP8 combine through kernel-level optimizations and launch configuration tuning.

Technical Details

  • Block-level grid barrier: Replace per-warp atomicAdd with __syncthreads() + per-block atomicAdd in both dispatch and combine kernels, reducing atomic contention by ~4x.
  • Combine source pointer compaction: Use __ballot + __popcll to compact non-null source pointers before accumulation. EP4 deduplication leaves ~50% null pointers; compaction halves the accumulation work. EP8 skips compaction automatically (nearly all pointers valid, ~3 cycle overhead).
  • EP4 launch config tuning: Sweep block_num × warp_per_block space for 64/128/256 tokens and update EP4 defaults in bench_dispatch_combine.py.
  • Tuning script: Add intranode_tuning.sh.

Test Result

EP4 Dispatch + Combine total latency (FP4 dispatch + FP8 combine, P2P write, MI355X 4-GPU):

Tokens/Rank Before After Speedup
64 72 µs 44 µs 1.64x
128 102 µs 49 µs 2.08x
256 178 µs 80 µs 2.23x

Correctness verified (bench warmup check) and 100k-round stress tested across EP4 FP4+FP8, EP4 BF16 zero-copy, EP8 FP4+FP8, and EP8 BF16 — all pass.

@jhchouuu jhchouuu merged commit e3cab4b into main Feb 17, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments