coll/ucc: fix ucc finalize leak on MPI_Finalize#13768
Open
janjust wants to merge 1 commit intoopen-mpi:v6.0.xfrom
Open
coll/ucc: fix ucc finalize leak on MPI_Finalize#13768janjust wants to merge 1 commit intoopen-mpi:v6.0.xfrom
janjust wants to merge 1 commit intoopen-mpi:v6.0.xfrom
Conversation
f16c5b8 to
cc8bdc3
Compare
All UCC team teardown and context/library finalization was driven by a communicator attribute delete callback on MPI_COMM_WORLD. OMPI commit b79004e (v5: 6a581ad) intentionally skips user-defined attribute callbacks on MPI_COMM_WORLD during MPI_Finalize to fix a PETSc deadlock (sec. open-mpi#12035), so ucc_context_destroy / ucc_finalize were never called. Remove the communicator attribute mechanism entirely: - Call ucc_team_destroy() directly from mca_coll_ucc_module_destruct(), which fires for every communicator including MPI_COMM_WORLD regardless of OMPI version. - Add mca_coll_ucc_finalize_ctx() and call it from the MPI_COMM_WORLD module destructor. ompi_comm_destruct() releases coll modules (firing destructors) before releasing c_local_group or calling PML del_comm, so MPI_COMM_WORLD is still fully functional for ucc_context_destroy's OOB allgather at that point. mca_coll_ucc_close() retains the call as an idempotent safety net. Signed-off-by: Tomislav Janjusic <[email protected]> (cherry picked from commit bbc0be5)
cc8bdc3 to
035b634
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
commit b79004e intentionally skips user-defined attribute delete callbacks on MPI_COMM_WORLD during MPI_Finalize to fix an issue where PETSc attribute callbacks invoked MPI after finalize had started (PR #12072 / issue #12035). As a side-effect, UCC's ucc_comm_attr_del_fn is never invoked for MPI_COMM_WORLD, so ucc_context_destroy() and ucc_finalize() are never called.
Fix: register a second attribute on MPI_COMM_SELF (ucc_self_attr_del_fn). Per MPI-4.1 sec. 11.2.4 the MPI_COMM_SELF delete callbacks are guaranteed to fire at the very start of MPI_Finalize, before any teardown. The new callback destroys the MPI_COMM_WORLD UCC team (if ucc_comm_attr_del_fn did not already do so) and then calls ucc_context_destroy/ucc_finalize.
Additional fixes bundled in this patch:
(cherry picked from commit e928049)