Skip to content

Add ProcessGroupXCCL functionality to pass test_c10d_xccl.py#3171

Open
frost-intel wants to merge 4 commits intomainfrom
frost/test_c10d_xccl_fixes
Open

Add ProcessGroupXCCL functionality to pass test_c10d_xccl.py#3171
frost-intel wants to merge 4 commits intomainfrom
frost/test_c10d_xccl_fixes

Conversation

@frost-intel
Copy link
Copy Markdown
Contributor

Recently enabled tests at https://github.com/daisyden/pytorch/blob/distributed_2.11/test/distributed/test_c10d_xccl.py show a few gaps in functionality for ProcessGroupXCCL.

Fixes #2967

Copilot AI review requested due to automatic review settings March 24, 2026 19:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fills in missing ProcessGroupXCCL backend capabilities needed to satisfy gaps exposed by test/distributed/test_c10d_xccl.py, mainly around initialization hooks, timeout configuration, and FlightRecorder trace/timing support.

Changes:

  • Add setTimeout() override, isInitialized(), and make setEnableNanCheck() publicly accessible on ProcessGroupXCCL.
  • Add reset_xccl_trace() to reset XCCL FlightRecorder state.
  • Enable duration computation for recorded collectives/P2P ops by switching FlightRecorderXCCL::retire_id(..., compute_duration) to true.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/xccl/ProcessGroupXCCL.hpp Exposes new/previously-nonpublic ProcessGroupXCCL APIs (setTimeout, isInitialized, setEnableNanCheck) and declares reset_xccl_trace().
src/xccl/ProcessGroupXCCL.cpp Implements reset_xccl_trace() and isInitialized(), updates timeout error throwing, and enables FlightRecorder duration computation on retirement callbacks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +372 to +378
if (devXCCLCommMap_.empty()) {
return false;
}
// Unlike PGNCCL, all comms are initialized (or we fail)
return true;
}

Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isInitialized() currently reports initialized only after at least one communicator is created (devXCCLCommMap_ non-empty). Since communicators are created lazily (e.g., first collective calls initXCCLComm()), this will return false immediately after ProcessGroupXCCL construction, which is likely not what Python is_initialized()/backend readiness checks expect. Consider tracking an explicit initialized flag (set after constructor/init completes) and/or renaming the method to reflect what it actually checks; also update the comment about comms being initialized up-front (it contradicts the current lazy-init behavior).

Suggested change
if (devXCCLCommMap_.empty()) {
return false;
}
// Unlike PGNCCL, all comms are initialized (or we fail)
return true;
}
// ProcessGroupXCCL is considered initialized once construction succeeds.
// Communicators are created lazily (e.g., on first collective), so
// initialization is not tied to devXCCLCommMap_ being non-empty.
return true;
}

Copilot uses AI. Check for mistakes.
options_->timeout = timeout;
}

bool isInitialized();
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isInitialized() is added without override (and without const). If this is intended to implement/override the Backend::isInitialized() virtual used by Python bindings, please match the base signature and mark it override so the compiler can catch mismatches (otherwise the base implementation may still be used).

Suggested change
bool isInitialized();
bool isInitialized() const override;

Copilot uses AI. Check for mistakes.
Comment on lines 770 to 774
work->future_->addCallback(
[id, reset_epoch](at::ivalue::Future&) {
FlightRecorderXCCL::get()->retire_id(
id, reset_epoch, /*compute_duration*/ false);
id, reset_epoch, /*compute_duration*/ true);
},
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retire_id(..., /*compute_duration*/ true) is now unconditional. When timing is disabled for a WorkXCCL (xcclStartEvent_ is nullptr), this may add avoidable overhead and can lead to inconsistent trace contents (e.g., duration_ms appearing even when timing is off, or duration computation with missing start events depending on FlightRecorder behavior). Consider making compute_duration conditional on the work/timing setting (e.g., capture work->timingEnabled_ or equivalent) so duration is computed only when timing is actually enabled.

Copilot uses AI. Check for mistakes.
Comment on lines 925 to 929
work->future_->addCallback(
[id, reset_epoch](at::ivalue::Future&) {
FlightRecorderXCCL::get()->retire_id(
id, reset_epoch, /*compute_duration*/ false);
id, reset_epoch, /*compute_duration*/ true);
},
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above for P2P: retire_id(..., /*compute_duration*/ true) is unconditional and should ideally be conditional on timing being enabled for this work to avoid overhead and trace/schema changes when timing is disabled.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown

Performance outliers, please check!

  • 🔴 [-1, 80%), should be regression
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training Background_Matting 0.763132 0.699989
torchbench_bfloat16_training alexnet 0.784896 0.754091
torchbench_bfloat16_training pytorch_unet 0.808368 0.755619
torchbench_bfloat16_training nvidia_deeprecommender 0.758746 0.761161
torchbench_bfloat16_training resnet50 0.772198 0.767220
torchbench_bfloat16_training vgg16 0.771852 0.882418
torchbench_bfloat16_training shufflenet_v2_x1_0 0.794900 0.889429
torchbench_bfloat16_training LearningToPaint 0.714729 0.947847
  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training mobilenet_v2 0.843230 0.839124
torchbench_bfloat16_training squeezenet1_1 0.831182 1.032032

@github-actions
Copy link
Copy Markdown

Performance outliers, please check!

  • 🔴 [-1, 80%), should be regression
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
huggingface_bfloat16_training RobertaForCausalLM 0.754173 0.720551
huggingface_float16_training DistilBertForMaskedLM 0.742165 0.736366
huggingface_float16_training RobertaForCausalLM 0.724203 0.736471
huggingface_float16_training MBartForCausalLM 0.775283 0.737618
huggingface_float16_training PegasusForCausalLM 0.729126 0.739225
huggingface_float16_training TrOCRForCausalLM 0.704423 0.739865
huggingface_float16_training PLBartForCausalLM 0.746446 0.745405
huggingface_float16_training BartForCausalLM 0.772044 0.746323
huggingface_bfloat16_training DistilBertForMaskedLM 0.780531 0.746984
huggingface_bfloat16_training MBartForCausalLM 0.799427 0.753490
huggingface_bfloat16_training PLBartForCausalLM 0.782411 0.758965
huggingface_bfloat16_training TrOCRForCausalLM 0.767763 0.760552
huggingface_bfloat16_training BertForMaskedLM 0.774522 0.763210
huggingface_bfloat16_training XLNetLMHeadModel 0.737075 0.764225
huggingface_float16_training DistillGPT2 0.719704 0.765206
huggingface_float16_training T5ForConditionalGeneration 0.759213 0.766725
huggingface_bfloat16_training BartForCausalLM 0.827395 0.768687
huggingface_bfloat16_training DistillGPT2 0.773987 0.768755
huggingface_float16_training T5Small 0.758678 0.773184
huggingface_bfloat16_training PegasusForCausalLM 0.777517 0.773254
huggingface_float16_training YituTechConvBert 0.716842 0.775056
huggingface_float16_training BertForMaskedLM 0.747784 0.777874
huggingface_float16_training XLNetLMHeadModel 0.697205 0.779041
huggingface_bfloat16_training T5Small 0.792645 0.779154
huggingface_bfloat16_training T5ForConditionalGeneration 0.786390 0.780767
huggingface_float16_training MegatronBertForCausalLM 0.796475 0.782444
huggingface_bfloat16_training AllenaiLongformerBase 0.690063 0.782644
huggingface_bfloat16_training OPTForCausalLM 0.840539 0.785099
huggingface_bfloat16_training LayoutLMForMaskedLM 0.797611 0.788736
huggingface_bfloat16_training YituTechConvBert 0.745813 0.790626
huggingface_float16_training OPTForCausalLM 0.819500 0.791187
huggingface_bfloat16_training MegatronBertForCausalLM 0.828578 0.791889
huggingface_float16_training LayoutLMForMaskedLM 0.761733 0.812670
huggingface_float16_training AlbertForMaskedLM 0.767943 0.824460
huggingface_float16_training AllenaiLongformerBase 0.677218 0.836126
huggingface_float16_training GPT2ForSequenceClassification 0.764022 0.846931
huggingface_float16_training XGLMForCausalLM 0.792153 0.846972
huggingface_float16_training ElectraForCausalLM 0.785877 0.891137
  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
huggingface_bfloat16_training ElectraForCausalLM 0.819469 0.819364
huggingface_bfloat16_training DebertaV2ForMaskedLM 0.888943 0.823332
huggingface_float16_training BlenderbotForCausalLM 0.832481 0.832124
huggingface_bfloat16_training M2M100ForConditionalGeneration 0.887186 0.832635
huggingface_float16_training M2M100ForConditionalGeneration 0.895787 0.838970
huggingface_bfloat16_training GPT2ForSequenceClassification 0.811198 0.841787
huggingface_bfloat16_training AlbertForMaskedLM 0.809017 0.844046
huggingface_float16_training DebertaV2ForMaskedLM 0.886248 0.847962
torchbench_bfloat16_training mnasnet1_0 1.018179 0.851450
huggingface_bfloat16_training BlenderbotForCausalLM 0.830904 0.860472
torchbench_bfloat16_training resnet18 0.991716 0.867141
huggingface_float16_training GoogleFnet 0.834713 0.868807
huggingface_bfloat16_training GoogleFnet 0.834979 0.869174
huggingface_bfloat16_training XGLMForCausalLM 0.866660 0.889654

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[distributed] KeyError in test/distributed/test_c10d_xccl.py [distributed] feature gaps in test/distributed/test_c10d_xccl.py

2 participants