Skip to content

Reland support premul sum for xccl#3173

Open
Chao1Han wants to merge 3 commits intomainfrom
xccl/reland
Open

Reland support premul sum for xccl#3173
Chao1Han wants to merge 3 commits intomainfrom
xccl/reland

Conversation

@Chao1Han
Copy link
Copy Markdown
Contributor

@Chao1Han Chao1Han commented Mar 25, 2026

Reland #1948

disable_e2e
disable_ut

Copilot AI review requested due to automatic review settings March 25, 2026 02:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reintroduces XCCL support for ReduceOp::PREMUL_SUM when building against oneCCL >= 2021.17 by adding version gating, reduction-op construction helpers, and RAII cleanup for custom reduction handles.

Changes:

  • Add compile-time ENABLE_XCCL_PREMUL_SUM_SUPPORT based on oneCCL version macros.
  • Introduce RAII wrappers and unpack helpers to create/destroy PREMUL_SUM reduction ops for both CCL “V1” and oneCCL “V2” APIs.
  • Update XCCL collective call sites to pass datatype/communicator needed to build PREMUL_SUM ops.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/xccl/xccl.h Adds PREMUL_SUM enablement, RAII reduction wrapper, and PREMUL_SUM mapping logic in getXcclReduceOpV1/V2.
src/xccl/xccl.cpp Updates allreduce/reduce/reduce_scatter wrappers to pass datatype + communicator into the reduce-op selection helpers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Copy Markdown

Performance outliers, please check!

  • 🔴 [-1, 80%), should be regression
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training Background_Matting 0.744062 0.682433
torchbench_bfloat16_training pytorch_unet 0.765713 0.725342
torchbench_bfloat16_training alexnet 0.772230 0.754138
torchbench_bfloat16_training resnet50 0.755595 0.764938
torchbench_bfloat16_training nvidia_deeprecommender 0.750006 0.770794
torchbench_bfloat16_training shufflenet_v2_x1_0 0.783204 0.858808
torchbench_bfloat16_training LearningToPaint 0.686767 0.862129
torchbench_bfloat16_training vgg16 0.752418 0.864969
  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training mobilenet_v2 0.843263 0.828889
torchbench_bfloat16_training resnet18 0.850908 0.839607
torchbench_bfloat16_training BERT_pytorch 1.011453 0.895164
torchbench_bfloat16_training mnasnet1_0 0.879395 0.946367
torchbench_bfloat16_training squeezenet1_1 0.821079 0.946886
torchbench_bfloat16_training resnext50_32x4d 0.895986 1.002764

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants