Cherrypick comm ownership ompiv6.0.x#13818
Open
mshanthagit wants to merge 2 commits intoopen-mpi:v6.0.xfrom
Open
Cherrypick comm ownership ompiv6.0.x#13818mshanthagit wants to merge 2 commits intoopen-mpi:v6.0.xfrom
mshanthagit wants to merge 2 commits intoopen-mpi:v6.0.xfrom
Conversation
…ership The OMPI_COMM_EXTRA_RETAIN flag was a workaround for a finalize-time ordering problem: during MPI_Finalize, the communicator table is walked in ascending CID order, and each communicator is OBJ_RELEASE'd. If a sub-communicator (e.g., c_local_comm of an inter-communicator, or a HAN sub-communicator) had a lower CID than its parent, it would be destroyed first, leaving the parent with a dangling pointer. EXTRA_RETAIN solved this by conditionally bumping the refcount (only when the sub-comm CID was lower than the parent's), setting a flag on the communicator, and adding matching release logic in ompi_comm_free and special-case handling in the finalize loop. This made the ownership model implicit and spread across multiple files. Replace this with explicit reference-counted ownership: - In ompi_comm_set_nb(), OBJ_RETAIN c_local_comm when it is assigned to the parent inter-communicator, regardless of CID ordering. - In ompi_comm_destruct(), OBJ_RELEASE c_local_comm if the parent is an inter-communicator. This ensures proper cleanup in both the finalize path (where the destructor cascades the release) and the user path (where ompi_comm_free NULLs the field before the destructor runs). - In ompi_comm_free(), release the ownership reference on c_local_comm (OBJ_RELEASE) before calling ompi_comm_free() on it recursively. The first release undoes the ownership retain; the recursive free handles attribute cleanup and releases the creation reference. - Simplify the finalize loop in ompi_comm_finalize(): a single OBJ_RELEASE per communicator is sufficient. Sub-communicators retained by their parent survive until the parent's destructor cascades the release, regardless of CID ordering. - Remove the OMPI_COMM_EXTRA_RETAIN flag, OMPI_COMM_IS_EXTRA_RETAIN(), and OMPI_COMM_SET_EXTRA_RETAIN() from communicator.h. - Remove the conditional EXTRA_RETAIN block from ompi_comm_activate_complete() in comm_cid.c. For the HAN collective module: - Replace the HAN_SUBCOM_EXTRA_RETAIN macro (conditional OBJ_RETAIN + flag set) with unconditional OBJ_RETAIN on sub-communicators. - In the HAN module destructor, after ompi_comm_free() on each sub-communicator, use ompi_comm_lookup() to check whether the sub-comm survived (refcount > 0), and if so, OBJ_RELEASE it. This safely handles both the user path (sub-comm survives ompi_comm_free with refcount 1, lookup finds it, OBJ_RELEASE destroys it) and the finalize path (sub-comm is destroyed by ompi_comm_free, lookup returns NULL). Signed-off-by: George Bosilca <[email protected]>
The ACOLL module creates many internal sub-communicators (local_comm, socket_comm, subgrp_comm, numa_comm, leader_comm, socket_ldr_comm, numa_comm_ldrs, local_r_comm, base_comm[][], split_comm[]) but never held ownership references on them. During MPI_Finalize, if a sub-comm has a lower CID than its parent, it could be released before the module's destructor runs, leading to use-after-free. Apply the same ownership model introduced for HAN in commit 9336f51cb2: - OBJ_RETAIN every sub-communicator immediately after creation. - Add a coll_acoll_subcomm_free() helper that saves the CID, calls ompi_comm_free (attribute cleanup + one OBJ_RELEASE), then uses ompi_comm_lookup to safely release the ownership reference. - Use this helper in both cleanup paths: the module destructor in coll_acoll_component.c and the re-initialization path in coll_acoll_utils.h. - Fix a pre-existing leak: numa_comm and numa_comm_ldrs were never freed in the module destructor; they are now. Signed-off-by: George Bosilca <[email protected]>
Contributor
Author
|
Cherry-pick from the main branch. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.