Skip to content

coll/ucc: fix ucc resource leak on MPI_Finalize#13767

Open
janjust wants to merge 1 commit intoopen-mpi:mainfrom
janjust:main-ucc-sharp-resource-leak-fix
Open

coll/ucc: fix ucc resource leak on MPI_Finalize#13767
janjust wants to merge 1 commit intoopen-mpi:mainfrom
janjust:main-ucc-sharp-resource-leak-fix

Conversation

@janjust
Copy link
Copy Markdown
Contributor

@janjust janjust commented Mar 17, 2026

All UCC team teardown and context/library finalization was driven by a
communicator attribute delete callback on MPI_COMM_WORLD. OMPI commit
b79004e (v5: 6a581ad) intentionally skips user-defined attribute
callbacks on MPI_COMM_WORLD during MPI_Finalize to fix a PETSc deadlock
(sec. #12035), so ucc_context_destroy / ucc_finalize were never called.

Remove the communicator attribute mechanism entirely:

Call ucc_team_destroy() directly from mca_coll_ucc_module_destruct(),
which fires for every communicator including MPI_COMM_WORLD regardless
of OMPI version.

Add mca_coll_ucc_finalize_ctx() and call it from the MPI_COMM_WORLD
module destructor. ompi_comm_destruct() releases coll modules (firing
destructors) before releasing c_local_group or calling PML del_comm,
so MPI_COMM_WORLD is still fully functional for ucc_context_destroy's
OOB allgather at that point. mca_coll_ucc_close() retains the call as
an idempotent safety net.

@janjust janjust requested a review from Sergei-Lebedev March 17, 2026 00:20
@janjust janjust requested review from bosilca and removed request for Sergei-Lebedev March 17, 2026 14:32
@janjust janjust force-pushed the main-ucc-sharp-resource-leak-fix branch from e928049 to 3bf61b2 Compare March 17, 2026 21:30
@janjust janjust changed the title coll/ucc: register cleanup on MPI_COMM_SELF to fix SHARP resource leak. coll/ucc: fix ucc finalize leak on MPI_Finalize Mar 17, 2026
@janjust janjust changed the title coll/ucc: fix ucc finalize leak on MPI_Finalize coll/ucc: fix ucc resource leak on MPI_Finalize Mar 17, 2026
All UCC team teardown and context/library finalization was driven by a
communicator attribute delete callback on MPI_COMM_WORLD. OMPI commit
b79004e (v5: 6a581ad) intentionally skips user-defined attribute
callbacks on MPI_COMM_WORLD during MPI_Finalize to fix a PETSc deadlock
(sec. open-mpi#12035), so ucc_context_destroy / ucc_finalize were never called.

Remove the communicator attribute mechanism entirely:

- Call ucc_team_destroy() directly from mca_coll_ucc_module_destruct(),
  which fires for every communicator including MPI_COMM_WORLD regardless
  of OMPI version.
- Add mca_coll_ucc_finalize_ctx() and call it from the MPI_COMM_WORLD
  module destructor. ompi_comm_destruct() releases coll modules (firing
  destructors) before releasing c_local_group or calling PML del_comm,
  so MPI_COMM_WORLD is still fully functional for ucc_context_destroy's
  OOB allgather at that point. mca_coll_ucc_close() retains the call as
  an idempotent safety net.

Signed-off-by: Tomislav Janjusic <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant