Add ProcessGroupXCCL functionality to pass test_c10d_xccl.py#3171
Add ProcessGroupXCCL functionality to pass test_c10d_xccl.py#3171frost-intel wants to merge 4 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fills in missing ProcessGroupXCCL backend capabilities needed to satisfy gaps exposed by test/distributed/test_c10d_xccl.py, mainly around initialization hooks, timeout configuration, and FlightRecorder trace/timing support.
Changes:
- Add
setTimeout()override,isInitialized(), and makesetEnableNanCheck()publicly accessible onProcessGroupXCCL. - Add
reset_xccl_trace()to reset XCCL FlightRecorder state. - Enable duration computation for recorded collectives/P2P ops by switching
FlightRecorderXCCL::retire_id(..., compute_duration)totrue.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/xccl/ProcessGroupXCCL.hpp |
Exposes new/previously-nonpublic ProcessGroupXCCL APIs (setTimeout, isInitialized, setEnableNanCheck) and declares reset_xccl_trace(). |
src/xccl/ProcessGroupXCCL.cpp |
Implements reset_xccl_trace() and isInitialized(), updates timeout error throwing, and enables FlightRecorder duration computation on retirement callbacks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (devXCCLCommMap_.empty()) { | ||
| return false; | ||
| } | ||
| // Unlike PGNCCL, all comms are initialized (or we fail) | ||
| return true; | ||
| } | ||
|
|
There was a problem hiding this comment.
isInitialized() currently reports initialized only after at least one communicator is created (devXCCLCommMap_ non-empty). Since communicators are created lazily (e.g., first collective calls initXCCLComm()), this will return false immediately after ProcessGroupXCCL construction, which is likely not what Python is_initialized()/backend readiness checks expect. Consider tracking an explicit initialized flag (set after constructor/init completes) and/or renaming the method to reflect what it actually checks; also update the comment about comms being initialized up-front (it contradicts the current lazy-init behavior).
| if (devXCCLCommMap_.empty()) { | |
| return false; | |
| } | |
| // Unlike PGNCCL, all comms are initialized (or we fail) | |
| return true; | |
| } | |
| // ProcessGroupXCCL is considered initialized once construction succeeds. | |
| // Communicators are created lazily (e.g., on first collective), so | |
| // initialization is not tied to devXCCLCommMap_ being non-empty. | |
| return true; | |
| } |
| options_->timeout = timeout; | ||
| } | ||
|
|
||
| bool isInitialized(); |
There was a problem hiding this comment.
isInitialized() is added without override (and without const). If this is intended to implement/override the Backend::isInitialized() virtual used by Python bindings, please match the base signature and mark it override so the compiler can catch mismatches (otherwise the base implementation may still be used).
| bool isInitialized(); | |
| bool isInitialized() const override; |
| work->future_->addCallback( | ||
| [id, reset_epoch](at::ivalue::Future&) { | ||
| FlightRecorderXCCL::get()->retire_id( | ||
| id, reset_epoch, /*compute_duration*/ false); | ||
| id, reset_epoch, /*compute_duration*/ true); | ||
| }, |
There was a problem hiding this comment.
retire_id(..., /*compute_duration*/ true) is now unconditional. When timing is disabled for a WorkXCCL (xcclStartEvent_ is nullptr), this may add avoidable overhead and can lead to inconsistent trace contents (e.g., duration_ms appearing even when timing is off, or duration computation with missing start events depending on FlightRecorder behavior). Consider making compute_duration conditional on the work/timing setting (e.g., capture work->timingEnabled_ or equivalent) so duration is computed only when timing is actually enabled.
| work->future_->addCallback( | ||
| [id, reset_epoch](at::ivalue::Future&) { | ||
| FlightRecorderXCCL::get()->retire_id( | ||
| id, reset_epoch, /*compute_duration*/ false); | ||
| id, reset_epoch, /*compute_duration*/ true); | ||
| }, |
There was a problem hiding this comment.
Same as above for P2P: retire_id(..., /*compute_duration*/ true) is unconditional and should ideally be conditional on timing being enabled for this work to avoid overhead and trace/schema changes when timing is disabled.
Performance outliers, please check!
|
Performance outliers, please check!
|
Recently enabled tests at https://github.com/daisyden/pytorch/blob/distributed_2.11/test/distributed/test_c10d_xccl.py show a few gaps in functionality for ProcessGroupXCCL.
Fixes #2967
reset_xccl_trace(): Enables_reset_fr_recording_xccl()isInitialized(): Enablesis_initialized()setEnableNanCheckpublicsetTimeout(): Enables_set_default_timeout