Auto-tune BiDir AllGather threshold per GPU architecture by saifhhasan · Pull Request #946 · meta-pytorch/torchcomms

saifhhasan · 2026-03-05T02:08:49Z

Summary:
Change the default of NCCL_CTRAN_ALLREDUCE_RING_BIDIR_AG_MAX_SIZE from a
hardcoded 128MB to -2 (auto-tune), which selects a per-GPU-architecture
threshold at runtime:

GB200 (Blackwell, SM >= 10): 128MB
H100 (Hopper, SM < 10): 4MB (conservative)

BiDir AllGather sends data in both directions during the AllGather phase
of AllReduceRing, reducing total steps. It benefits small-to-medium
messages where ring latency dominates, but hurts large messages where
the extra coordination overhead outweighs the reduced step count.

The optimal crossover depends on the platform's bandwidth-delay product
(BDP), which varies by GPU architecture:

GB200 benchmarks (aarch64, IB-only, ppn=1, 8/16/32/64 nodes):

BiDir consistently outperforms NoBidir for messages up to 64-128MB
1M-64M: +10-28% busBW improvement
128M+: marginal or mixed results

H100 benchmarks (x86_64, IB-only, ppn=1, 8/16/32/64 nodes):

BiDir wins at smaller sizes, crossover scales with node count:
- 8N: BiDir wins up to 4MB (+6.1%)
- 16N: BiDir wins up to 8MB (+10.3%)
- 32N: BiDir wins up to 16MB (+12.8%)
- 64N: BiDir wins up to 32MB (+16.5%)
Conservative threshold: 4MB (safe across all node counts)

MCCL Auto-Tuned vs NCCL baseline (H100, sizes >= 512KB):

Nodes	Avg % Diff	Min % Diff	Max % Diff
8N	+0.5%	-0.3%	+3.6%
16N	-0.9%	-5.2%	+6.9%
32N	+0.9%	-4.8%	+19.4%
64N	+3.1%	-0.5%	+17.0%

CVAR semantics updated:

0: disabled
-1: enabled for all sizes
-2: auto-tune per GPU architecture (new default)
0: explicit threshold in bytes

Differential Revision: D94867201

Summary: Change the default of NCCL_CTRAN_ALLREDUCE_RING_BIDIR_AG_MAX_SIZE from a hardcoded 128MB to -2 (auto-tune), which selects a per-GPU-architecture threshold at runtime: - GB200 (Blackwell, SM >= 10): 128MB - H100 (Hopper, SM < 10): 4MB (conservative) BiDir AllGather sends data in both directions during the AllGather phase of AllReduceRing, reducing total steps. It benefits small-to-medium messages where ring latency dominates, but hurts large messages where the extra coordination overhead outweighs the reduced step count. The optimal crossover depends on the platform's bandwidth-delay product (BDP), which varies by GPU architecture: **GB200 benchmarks** (aarch64, IB-only, ppn=1, 8/16/32/64 nodes): - BiDir consistently outperforms NoBidir for messages up to 64-128MB - 1M-64M: +10-28% busBW improvement - 128M+: marginal or mixed results **H100 benchmarks** (x86_64, IB-only, ppn=1, 8/16/32/64 nodes): - BiDir wins at smaller sizes, crossover scales with node count: - 8N: BiDir wins up to 4MB (+6.1%) - 16N: BiDir wins up to 8MB (+10.3%) - 32N: BiDir wins up to 16MB (+12.8%) - 64N: BiDir wins up to 32MB (+16.5%) - Conservative threshold: 4MB (safe across all node counts) **MCCL Auto-Tuned vs NCCL baseline (H100, sizes >= 512KB):** | Nodes | Avg % Diff | Min % Diff | Max % Diff | |-------|-----------|-----------|-----------| | 8N | +0.5% | -0.3% | +3.6% | | 16N | -0.9% | -5.2% | +6.9% | | 32N | +0.9% | -4.8% | +19.4% | | 64N | +3.1% | -0.5% | +17.0% | CVAR semantics updated: - 0: disabled - -1: enabled for all sizes - -2: auto-tune per GPU architecture (new default) - >0: explicit threshold in bytes Differential Revision: D94867201

meta-codesync · 2026-03-05T02:08:57Z

@saifhhasan has exported this pull request. If you are a Meta employee, you can view the originating Diff in D94867201.

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 5, 2026

meta-codesync bot added fb-exported meta-exported labels Mar 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-tune BiDir AllGather threshold per GPU architecture#946

Auto-tune BiDir AllGather threshold per GPU architecture#946
saifhhasan wants to merge 1 commit intometa-pytorch:mainfrom
saifhhasan:export-D94867201

saifhhasan commented Mar 5, 2026

Uh oh!

meta-codesync bot commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

saifhhasan commented Mar 5, 2026

Uh oh!

meta-codesync bot commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant