Skip to content

Auto-tune BiDir AllGather threshold per GPU architecture#946

Open
saifhhasan wants to merge 1 commit intometa-pytorch:mainfrom
saifhhasan:export-D94867201
Open

Auto-tune BiDir AllGather threshold per GPU architecture#946
saifhhasan wants to merge 1 commit intometa-pytorch:mainfrom
saifhhasan:export-D94867201

Conversation

@saifhhasan
Copy link

Summary:
Change the default of NCCL_CTRAN_ALLREDUCE_RING_BIDIR_AG_MAX_SIZE from a
hardcoded 128MB to -2 (auto-tune), which selects a per-GPU-architecture
threshold at runtime:

  • GB200 (Blackwell, SM >= 10): 128MB
  • H100 (Hopper, SM < 10): 4MB (conservative)

BiDir AllGather sends data in both directions during the AllGather phase
of AllReduceRing, reducing total steps. It benefits small-to-medium
messages where ring latency dominates, but hurts large messages where
the extra coordination overhead outweighs the reduced step count.

The optimal crossover depends on the platform's bandwidth-delay product
(BDP), which varies by GPU architecture:

GB200 benchmarks (aarch64, IB-only, ppn=1, 8/16/32/64 nodes):

  • BiDir consistently outperforms NoBidir for messages up to 64-128MB
  • 1M-64M: +10-28% busBW improvement
  • 128M+: marginal or mixed results

H100 benchmarks (x86_64, IB-only, ppn=1, 8/16/32/64 nodes):

  • BiDir wins at smaller sizes, crossover scales with node count:
    • 8N: BiDir wins up to 4MB (+6.1%)
    • 16N: BiDir wins up to 8MB (+10.3%)
    • 32N: BiDir wins up to 16MB (+12.8%)
    • 64N: BiDir wins up to 32MB (+16.5%)
  • Conservative threshold: 4MB (safe across all node counts)

MCCL Auto-Tuned vs NCCL baseline (H100, sizes >= 512KB):

Nodes Avg % Diff Min % Diff Max % Diff
8N +0.5% -0.3% +3.6%
16N -0.9% -5.2% +6.9%
32N +0.9% -4.8% +19.4%
64N +3.1% -0.5% +17.0%

CVAR semantics updated:

  • 0: disabled
  • -1: enabled for all sizes
  • -2: auto-tune per GPU architecture (new default)
  • 0: explicit threshold in bytes

Differential Revision: D94867201

Summary:
Change the default of NCCL_CTRAN_ALLREDUCE_RING_BIDIR_AG_MAX_SIZE from a
hardcoded 128MB to -2 (auto-tune), which selects a per-GPU-architecture
threshold at runtime:
- GB200 (Blackwell, SM >= 10): 128MB
- H100 (Hopper, SM < 10): 4MB (conservative)

BiDir AllGather sends data in both directions during the AllGather phase
of AllReduceRing, reducing total steps. It benefits small-to-medium
messages where ring latency dominates, but hurts large messages where
the extra coordination overhead outweighs the reduced step count.

The optimal crossover depends on the platform's bandwidth-delay product
(BDP), which varies by GPU architecture:

**GB200 benchmarks** (aarch64, IB-only, ppn=1, 8/16/32/64 nodes):
- BiDir consistently outperforms NoBidir for messages up to 64-128MB
- 1M-64M: +10-28% busBW improvement
- 128M+: marginal or mixed results

**H100 benchmarks** (x86_64, IB-only, ppn=1, 8/16/32/64 nodes):
- BiDir wins at smaller sizes, crossover scales with node count:
  - 8N:  BiDir wins up to 4MB (+6.1%)
  - 16N: BiDir wins up to 8MB (+10.3%)
  - 32N: BiDir wins up to 16MB (+12.8%)
  - 64N: BiDir wins up to 32MB (+16.5%)
- Conservative threshold: 4MB (safe across all node counts)

**MCCL Auto-Tuned vs NCCL baseline (H100, sizes >= 512KB):**
| Nodes | Avg % Diff | Min % Diff | Max % Diff |
|-------|-----------|-----------|-----------|
| 8N    | +0.5%     | -0.3%     | +3.6%     |
| 16N   | -0.9%     | -5.2%     | +6.9%     |
| 32N   | +0.9%     | -4.8%     | +19.4%    |
| 64N   | +3.1%     | -0.5%     | +17.0%    |

CVAR semantics updated:
- 0: disabled
- -1: enabled for all sizes
- -2: auto-tune per GPU architecture (new default)
- >0: explicit threshold in bytes

Differential Revision: D94867201
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 5, 2026
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Mar 5, 2026

@saifhhasan has exported this pull request. If you are a Meta employee, you can view the originating Diff in D94867201.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant