Skip to content

v5.0.x revert reorder and fix ucc via comm self#13816

Open
janjust wants to merge 2 commits intoopen-mpi:v5.0.xfrom
janjust:v5.0.x-revert-reorder-and-fix-ucc-via-comm-self
Open

v5.0.x revert reorder and fix ucc via comm self#13816
janjust wants to merge 2 commits intoopen-mpi:v5.0.xfrom
janjust:v5.0.x-revert-reorder-and-fix-ucc-via-comm-self

Conversation

@janjust
Copy link
Copy Markdown
Contributor

@janjust janjust commented Apr 9, 2026

revert the reording of comm world free and fix the ucc teardown with comm_self attribute

bot:notacherrypick

janjust added 2 commits April 9, 2026 17:24
Revert commit 4370cd8 which moved the cid>=3 communicator cleanup
loop to before OBJ_DESTRUCT(&ompi_mpi_comm_world).

The reordering breaks coll components (e.g. HAN, ACOLL) that create
internal sub-communicators on behalf of MPI_COMM_WORLD: those sub-comms
have cid >= 3 and carry no EXTRA_RETAIN flag, so the loop frees them
before COMM_WORLD is destructed, leaving COMM_WORLD's module destructors
with dangling pointers causing use-after-free.

Restore the original ordering: COMM_SELF and COMM_WORLD are destructed
first, then MPI_COMM_NULL, then the cid>=3 leak-cleanup loop.

Signed-off-by: Tomislav Janjusic <[email protected]>
…lize

Register a MPI_COMM_SELF attribute delete callback to initiate
ucc_team_destroy(COMM_WORLD) at the start of MPI_Finalize, before any
communicator is touched.  This avoids blocking on collective transports
(SHARP, NCCL) that require all ranks before the PMIx barrier.

COMM_WORLD's module_destruct completes the spin after the PMIx barrier
and calls mca_coll_ucc_finalize_ctx() while COMM_WORLD's group is still
valid.  This is required because ucc_context_destroy runs a UCP OOB
allgather that dereferences comm->c_local_group via ompi_comm_size();
deferring to component close (after ompi_comm_destruct releases the
group) causes a NULL dereference crash.

Sub-communicator modules are tracked in a component-level
opal_pointer_array_t.  Each module removes itself on normal destruct;
COMM_WORLD's destructor sweeps any remaining entries (leaked
communicators) before calling ucc_context_destroy, ensuring all UCC
teams are gone before the context is torn down.

Signed-off-by: Tomislav Janjusic <[email protected]>
@github-actions github-actions bot added this to the v5.0.10 milestone Apr 9, 2026
@janjust janjust requested a review from bosilca April 9, 2026 22:34
@janjust janjust changed the title V5.0.x revert reorder and fix ucc via comm self v5.0.x revert reorder and fix ucc via comm self Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants