v5.0.x revert reorder and fix ucc via comm self#13816
Open
janjust wants to merge 2 commits intoopen-mpi:v5.0.xfrom
Open
v5.0.x revert reorder and fix ucc via comm self#13816janjust wants to merge 2 commits intoopen-mpi:v5.0.xfrom
janjust wants to merge 2 commits intoopen-mpi:v5.0.xfrom
Conversation
Revert commit 4370cd8 which moved the cid>=3 communicator cleanup loop to before OBJ_DESTRUCT(&ompi_mpi_comm_world). The reordering breaks coll components (e.g. HAN, ACOLL) that create internal sub-communicators on behalf of MPI_COMM_WORLD: those sub-comms have cid >= 3 and carry no EXTRA_RETAIN flag, so the loop frees them before COMM_WORLD is destructed, leaving COMM_WORLD's module destructors with dangling pointers causing use-after-free. Restore the original ordering: COMM_SELF and COMM_WORLD are destructed first, then MPI_COMM_NULL, then the cid>=3 leak-cleanup loop. Signed-off-by: Tomislav Janjusic <[email protected]>
…lize Register a MPI_COMM_SELF attribute delete callback to initiate ucc_team_destroy(COMM_WORLD) at the start of MPI_Finalize, before any communicator is touched. This avoids blocking on collective transports (SHARP, NCCL) that require all ranks before the PMIx barrier. COMM_WORLD's module_destruct completes the spin after the PMIx barrier and calls mca_coll_ucc_finalize_ctx() while COMM_WORLD's group is still valid. This is required because ucc_context_destroy runs a UCP OOB allgather that dereferences comm->c_local_group via ompi_comm_size(); deferring to component close (after ompi_comm_destruct releases the group) causes a NULL dereference crash. Sub-communicator modules are tracked in a component-level opal_pointer_array_t. Each module removes itself on normal destruct; COMM_WORLD's destructor sweeps any remaining entries (leaked communicators) before calling ucc_context_destroy, ensuring all UCC teams are gone before the context is torn down. Signed-off-by: Tomislav Janjusic <[email protected]>
edgargabriel
approved these changes
Apr 13, 2026
bosilca
approved these changes
Apr 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
revert the reording of comm world free and fix the ucc teardown with comm_self attribute
bot:notacherrypick