Skip to content

coll/ucc: fix ucc finalize leak on MPI_Finalize#13768

Open
janjust wants to merge 1 commit intoopen-mpi:v6.0.xfrom
janjust:v6.0.x-ucc-sharp-resource-leak-fix
Open

coll/ucc: fix ucc finalize leak on MPI_Finalize#13768
janjust wants to merge 1 commit intoopen-mpi:v6.0.xfrom
janjust:v6.0.x-ucc-sharp-resource-leak-fix

Conversation

@janjust
Copy link
Copy Markdown
Contributor

@janjust janjust commented Mar 17, 2026

commit b79004e intentionally skips user-defined attribute delete callbacks on MPI_COMM_WORLD during MPI_Finalize to fix an issue where PETSc attribute callbacks invoked MPI after finalize had started (PR #12072 / issue #12035). As a side-effect, UCC's ucc_comm_attr_del_fn is never invoked for MPI_COMM_WORLD, so ucc_context_destroy() and ucc_finalize() are never called.

Fix: register a second attribute on MPI_COMM_SELF (ucc_self_attr_del_fn). Per MPI-4.1 sec. 11.2.4 the MPI_COMM_SELF delete callbacks are guaranteed to fire at the very start of MPI_Finalize, before any teardown. The new callback destroys the MPI_COMM_WORLD UCC team (if ucc_comm_attr_del_fn did not already do so) and then calls ucc_context_destroy/ucc_finalize.

Additional fixes bundled in this patch:

  • ucc_comm_attr_del_fn: add libucc_initialized guard so the callback is a safe no-op if invoked after the context has been torn down.

(cherry picked from commit e928049)

@github-actions github-actions bot added this to the v6.0.0 milestone Mar 17, 2026
@janjust janjust requested review from bosilca and removed request for Sergei-Lebedev March 17, 2026 14:32
@janjust janjust force-pushed the v6.0.x-ucc-sharp-resource-leak-fix branch 2 times, most recently from f16c5b8 to cc8bdc3 Compare March 17, 2026 21:44
All UCC team teardown and context/library finalization was driven by a
communicator attribute delete callback on MPI_COMM_WORLD. OMPI commit
b79004e (v5: 6a581ad) intentionally skips user-defined attribute
callbacks on MPI_COMM_WORLD during MPI_Finalize to fix a PETSc deadlock
(sec. open-mpi#12035), so ucc_context_destroy / ucc_finalize were never called.

Remove the communicator attribute mechanism entirely:

- Call ucc_team_destroy() directly from mca_coll_ucc_module_destruct(),
  which fires for every communicator including MPI_COMM_WORLD regardless
  of OMPI version.
- Add mca_coll_ucc_finalize_ctx() and call it from the MPI_COMM_WORLD
  module destructor. ompi_comm_destruct() releases coll modules (firing
  destructors) before releasing c_local_group or calling PML del_comm,
  so MPI_COMM_WORLD is still fully functional for ucc_context_destroy's
  OOB allgather at that point. mca_coll_ucc_close() retains the call as
  an idempotent safety net.

Signed-off-by: Tomislav Janjusic <[email protected]>
(cherry picked from commit bbc0be5)
@janjust janjust force-pushed the v6.0.x-ucc-sharp-resource-leak-fix branch from cc8bdc3 to 035b634 Compare March 17, 2026 21:51
@janjust janjust changed the title coll/ucc: register cleanup on MPI_COMM_SELF to fix SHARP resource leak coll/ucc: fix ucc finalize leak on MPI_Finalize Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant