This issue tracks a series of 3 pull request(s) targeting ROCm/aiter.
Status: PRs being prepared — full description will be added shortly.
- PR 1: [Perf][Kernel] Add decode buffer caches to eliminate per-step HIP malloc in fused_moe
- PR 2: [Perf][Kernel] Add gfx950 1-stage ASM fast path for FP8 blockscale decode (ntok<=512)
- PR 3: [Kernel][Perf] Add MiniMax-M2.5 GEMM and FMoE tuning configs for gfx950
This issue tracks a series of 3 pull request(s) targeting
ROCm/aiter.Status: PRs being prepared — full description will be added shortly.