Add ProcessGroupXCCL functionality to pass test_c10d_xccl.py by frost-intel · Pull Request #3171 · intel/torch-xpu-ops

frost-intel · 2026-03-24T19:34:05Z

Recently enabled tests at https://github.com/daisyden/pytorch/blob/distributed_2.11/test/distributed/test_c10d_xccl.py show a few gaps in functionality for ProcessGroupXCCL.

Fixes #2967

reset_xccl_trace(): Enables _reset_fr_recording_xccl()
isInitialized(): Enables is_initialized()
Make setEnableNanCheck public
setTimeout(): Enables _set_default_timeout
Enable timing of collectives within FlighRecorder (fixes [distributed] KeyError in test/distributed/test_c10d_xccl.py #2971 )

Copilot

Pull request overview

This PR fills in missing ProcessGroupXCCL backend capabilities needed to satisfy gaps exposed by test/distributed/test_c10d_xccl.py, mainly around initialization hooks, timeout configuration, and FlightRecorder trace/timing support.

Changes:

Add setTimeout() override, isInitialized(), and make setEnableNanCheck() publicly accessible on ProcessGroupXCCL.
Add reset_xccl_trace() to reset XCCL FlightRecorder state.
Enable duration computation for recorded collectives/P2P ops by switching FlightRecorderXCCL::retire_id(..., compute_duration) to true.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`src/xccl/ProcessGroupXCCL.hpp`	Exposes new/previously-nonpublic ProcessGroupXCCL APIs (`setTimeout`, `isInitialized`, `setEnableNanCheck`) and declares `reset_xccl_trace()`.
`src/xccl/ProcessGroupXCCL.cpp`	Implements `reset_xccl_trace()` and `isInitialized()`, updates timeout error throwing, and enables FlightRecorder duration computation on retirement callbacks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-24T19:39:46Z

src/xccl/ProcessGroupXCCL.cpp

+  if (devXCCLCommMap_.empty()) {
+    return false;
+  }
+  // Unlike PGNCCL, all comms are initialized (or we fail)
+  return true;
+}
+


isInitialized() currently reports initialized only after at least one communicator is created (devXCCLCommMap_ non-empty). Since communicators are created lazily (e.g., first collective calls initXCCLComm()), this will return false immediately after ProcessGroupXCCL construction, which is likely not what Python is_initialized()/backend readiness checks expect. Consider tracking an explicit initialized flag (set after constructor/init completes) and/or renaming the method to reflect what it actually checks; also update the comment about comms being initialized up-front (it contradicts the current lazy-init behavior).

Suggested change

if (devXCCLCommMap_.empty()) {

return false;

}

// Unlike PGNCCL, all comms are initialized (or we fail)

return true;

}

// ProcessGroupXCCL is considered initialized once construction succeeds.

// Communicators are created lazily (e.g., on first collective), so

// initialization is not tied to devXCCLCommMap_ being non-empty.

return true;

}

Copilot · 2026-03-24T19:39:47Z

src/xccl/ProcessGroupXCCL.hpp

+    options_->timeout = timeout;
+  }
+
+  bool isInitialized();


isInitialized() is added without override (and without const). If this is intended to implement/override the Backend::isInitialized() virtual used by Python bindings, please match the base signature and mark it override so the compiler can catch mismatches (otherwise the base implementation may still be used).

Suggested change

bool isInitialized();

bool isInitialized() const override;

Copilot · 2026-03-24T19:39:47Z

src/xccl/ProcessGroupXCCL.cpp

  work->future_->addCallback(
      [id, reset_epoch](at::ivalue::Future&) {
        FlightRecorderXCCL::get()->retire_id(
-            id, reset_epoch, /*compute_duration*/ false);
+            id, reset_epoch, /*compute_duration*/ true);
      },


retire_id(..., /*compute_duration*/ true) is now unconditional. When timing is disabled for a WorkXCCL (xcclStartEvent_ is nullptr), this may add avoidable overhead and can lead to inconsistent trace contents (e.g., duration_ms appearing even when timing is off, or duration computation with missing start events depending on FlightRecorder behavior). Consider making compute_duration conditional on the work/timing setting (e.g., capture work->timingEnabled_ or equivalent) so duration is computed only when timing is actually enabled.

Copilot · 2026-03-24T19:39:47Z

src/xccl/ProcessGroupXCCL.cpp

    work->future_->addCallback(
        [id, reset_epoch](at::ivalue::Future&) {
          FlightRecorderXCCL::get()->retire_id(
-              id, reset_epoch, /*compute_duration*/ false);
+              id, reset_epoch, /*compute_duration*/ true);
        },


Same as above for P2P: retire_id(..., /*compute_duration*/ true) is unconditional and should ideally be conditional on timing being enabled for this work to avoid overhead and trace/schema changes when timing is disabled.

github-actions · 2026-03-24T23:09:38Z

Performance outliers, please check!

🔴 [-1, 80%), should be regression

Category	Model	Target vs. Baseline [Eager]	Target vs. Baseline [Inductor]
torchbench_bfloat16_training	Background_Matting	0.763132	0.699989
torchbench_bfloat16_training	alexnet	0.784896	0.754091
torchbench_bfloat16_training	pytorch_unet	0.808368	0.755619
torchbench_bfloat16_training	nvidia_deeprecommender	0.758746	0.761161
torchbench_bfloat16_training	resnet50	0.772198	0.767220
torchbench_bfloat16_training	vgg16	0.771852	0.882418
torchbench_bfloat16_training	shufflenet_v2_x1_0	0.794900	0.889429
torchbench_bfloat16_training	LearningToPaint	0.714729	0.947847

🟡 [80%, 90%), may be fluctuations

Category	Model	Target vs. Baseline [Eager]	Target vs. Baseline [Inductor]
torchbench_bfloat16_training	mobilenet_v2	0.843230	0.839124
torchbench_bfloat16_training	squeezenet1_1	0.831182	1.032032

github-actions · 2026-03-25T21:41:40Z

Performance outliers, please check!

🔴 [-1, 80%), should be regression

Category	Model	Target vs. Baseline [Eager]	Target vs. Baseline [Inductor]
huggingface_bfloat16_training	RobertaForCausalLM	0.754173	0.720551
huggingface_float16_training	DistilBertForMaskedLM	0.742165	0.736366
huggingface_float16_training	RobertaForCausalLM	0.724203	0.736471
huggingface_float16_training	MBartForCausalLM	0.775283	0.737618
huggingface_float16_training	PegasusForCausalLM	0.729126	0.739225
huggingface_float16_training	TrOCRForCausalLM	0.704423	0.739865
huggingface_float16_training	PLBartForCausalLM	0.746446	0.745405
huggingface_float16_training	BartForCausalLM	0.772044	0.746323
huggingface_bfloat16_training	DistilBertForMaskedLM	0.780531	0.746984
huggingface_bfloat16_training	MBartForCausalLM	0.799427	0.753490
huggingface_bfloat16_training	PLBartForCausalLM	0.782411	0.758965
huggingface_bfloat16_training	TrOCRForCausalLM	0.767763	0.760552
huggingface_bfloat16_training	BertForMaskedLM	0.774522	0.763210
huggingface_bfloat16_training	XLNetLMHeadModel	0.737075	0.764225
huggingface_float16_training	DistillGPT2	0.719704	0.765206
huggingface_float16_training	T5ForConditionalGeneration	0.759213	0.766725
huggingface_bfloat16_training	BartForCausalLM	0.827395	0.768687
huggingface_bfloat16_training	DistillGPT2	0.773987	0.768755
huggingface_float16_training	T5Small	0.758678	0.773184
huggingface_bfloat16_training	PegasusForCausalLM	0.777517	0.773254
huggingface_float16_training	YituTechConvBert	0.716842	0.775056
huggingface_float16_training	BertForMaskedLM	0.747784	0.777874
huggingface_float16_training	XLNetLMHeadModel	0.697205	0.779041
huggingface_bfloat16_training	T5Small	0.792645	0.779154
huggingface_bfloat16_training	T5ForConditionalGeneration	0.786390	0.780767
huggingface_float16_training	MegatronBertForCausalLM	0.796475	0.782444
huggingface_bfloat16_training	AllenaiLongformerBase	0.690063	0.782644
huggingface_bfloat16_training	OPTForCausalLM	0.840539	0.785099
huggingface_bfloat16_training	LayoutLMForMaskedLM	0.797611	0.788736
huggingface_bfloat16_training	YituTechConvBert	0.745813	0.790626
huggingface_float16_training	OPTForCausalLM	0.819500	0.791187
huggingface_bfloat16_training	MegatronBertForCausalLM	0.828578	0.791889
huggingface_float16_training	LayoutLMForMaskedLM	0.761733	0.812670
huggingface_float16_training	AlbertForMaskedLM	0.767943	0.824460
huggingface_float16_training	AllenaiLongformerBase	0.677218	0.836126
huggingface_float16_training	GPT2ForSequenceClassification	0.764022	0.846931
huggingface_float16_training	XGLMForCausalLM	0.792153	0.846972
huggingface_float16_training	ElectraForCausalLM	0.785877	0.891137

🟡 [80%, 90%), may be fluctuations

Category	Model	Target vs. Baseline [Eager]	Target vs. Baseline [Inductor]
huggingface_bfloat16_training	ElectraForCausalLM	0.819469	0.819364
huggingface_bfloat16_training	DebertaV2ForMaskedLM	0.888943	0.823332
huggingface_float16_training	BlenderbotForCausalLM	0.832481	0.832124
huggingface_bfloat16_training	M2M100ForConditionalGeneration	0.887186	0.832635
huggingface_float16_training	M2M100ForConditionalGeneration	0.895787	0.838970
huggingface_bfloat16_training	GPT2ForSequenceClassification	0.811198	0.841787
huggingface_bfloat16_training	AlbertForMaskedLM	0.809017	0.844046
huggingface_float16_training	DebertaV2ForMaskedLM	0.886248	0.847962
torchbench_bfloat16_training	mnasnet1_0	1.018179	0.851450
huggingface_bfloat16_training	BlenderbotForCausalLM	0.830904	0.860472
torchbench_bfloat16_training	resnet18	0.991716	0.867141
huggingface_float16_training	GoogleFnet	0.834713	0.868807
huggingface_bfloat16_training	GoogleFnet	0.834979	0.869174
huggingface_bfloat16_training	XGLMForCausalLM	0.866660	0.889654

frost-intel added 3 commits March 24, 2026 10:50

Add timeout, isinitialized funcs for tests

22ed75f

Add fr reset

cdf5b2e

Enable timing for traces

fa4c9c2

Copilot AI review requested due to automatic review settings March 24, 2026 19:34

Copilot started reviewing on behalf of frost-intel March 24, 2026 19:34 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

Merge branch 'main' into frost/test_c10d_xccl_fixes

b8fe30f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ProcessGroupXCCL functionality to pass test_c10d_xccl.py#3171

Add ProcessGroupXCCL functionality to pass test_c10d_xccl.py#3171
frost-intel wants to merge 4 commits intomainfrom
frost/test_c10d_xccl_fixes

frost-intel commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

frost-intel commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 24, 2026

Performance outliers, please check!

Uh oh!

github-actions bot commented Mar 25, 2026

Performance outliers, please check!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants