Skip to content

Cherrypick comm ownership ompiv6.0.x#13818

Open
mshanthagit wants to merge 2 commits intoopen-mpi:v6.0.xfrom
mshanthagit:cherrypick-comm-ownership-ompiv6.0.x
Open

Cherrypick comm ownership ompiv6.0.x#13818
mshanthagit wants to merge 2 commits intoopen-mpi:v6.0.xfrom
mshanthagit:cherrypick-comm-ownership-ompiv6.0.x

Conversation

@mshanthagit
Copy link
Copy Markdown
Contributor

No description provided.

bosilca added 2 commits April 13, 2026 10:41
…ership

The OMPI_COMM_EXTRA_RETAIN flag was a workaround for a finalize-time
ordering problem: during MPI_Finalize, the communicator table is walked
in ascending CID order, and each communicator is OBJ_RELEASE'd. If a
sub-communicator (e.g., c_local_comm of an inter-communicator, or a HAN
sub-communicator) had a lower CID than its parent, it would be destroyed
first, leaving the parent with a dangling pointer.
EXTRA_RETAIN solved this by conditionally bumping the refcount (only
when the sub-comm CID was lower than the parent's), setting a flag on
the communicator, and adding matching release logic in ompi_comm_free
and special-case handling in the finalize loop. This made the ownership
model implicit and spread across multiple files.
Replace this with explicit reference-counted ownership:
- In ompi_comm_set_nb(), OBJ_RETAIN c_local_comm when it is assigned to
  the parent inter-communicator, regardless of CID ordering.
- In ompi_comm_destruct(), OBJ_RELEASE c_local_comm if the parent is an
  inter-communicator. This ensures proper cleanup in both the finalize
  path (where the destructor cascades the release) and the user path
  (where ompi_comm_free NULLs the field before the destructor runs).
- In ompi_comm_free(), release the ownership reference on c_local_comm
  (OBJ_RELEASE) before calling ompi_comm_free() on it recursively. The
  first release undoes the ownership retain; the recursive free handles
  attribute cleanup and releases the creation reference.
- Simplify the finalize loop in ompi_comm_finalize(): a single
  OBJ_RELEASE per communicator is sufficient.  Sub-communicators retained
  by their parent survive until the parent's destructor cascades the
  release, regardless of CID ordering.
- Remove the OMPI_COMM_EXTRA_RETAIN flag, OMPI_COMM_IS_EXTRA_RETAIN(),
  and OMPI_COMM_SET_EXTRA_RETAIN() from communicator.h.
- Remove the conditional EXTRA_RETAIN block from
  ompi_comm_activate_complete() in comm_cid.c.

For the HAN collective module:
- Replace the HAN_SUBCOM_EXTRA_RETAIN macro (conditional OBJ_RETAIN +
  flag set) with unconditional OBJ_RETAIN on sub-communicators.
- In the HAN module destructor, after ompi_comm_free() on each
  sub-communicator, use ompi_comm_lookup() to check whether the sub-comm
  survived (refcount > 0), and if so, OBJ_RELEASE it. This safely
  handles both the user path (sub-comm survives ompi_comm_free with
  refcount 1, lookup finds it, OBJ_RELEASE destroys it) and the finalize
  path (sub-comm is destroyed by ompi_comm_free, lookup returns NULL).

Signed-off-by: George Bosilca <[email protected]>
The ACOLL module creates many internal sub-communicators (local_comm,
socket_comm, subgrp_comm, numa_comm, leader_comm, socket_ldr_comm,
numa_comm_ldrs, local_r_comm, base_comm[][], split_comm[]) but never
held ownership references on them. During MPI_Finalize, if a sub-comm
has a lower CID than its parent, it could be released before the
module's destructor runs, leading to use-after-free.

Apply the same ownership model introduced for HAN in commit 9336f51cb2:
- OBJ_RETAIN every sub-communicator immediately after creation.
- Add a coll_acoll_subcomm_free() helper that saves the CID, calls
  ompi_comm_free (attribute cleanup + one OBJ_RELEASE), then uses
  ompi_comm_lookup to safely release the ownership reference.
- Use this helper in both cleanup paths: the module destructor in
  coll_acoll_component.c and the re-initialization path in
  coll_acoll_utils.h.
- Fix a pre-existing leak: numa_comm and numa_comm_ldrs were never
  freed in the module destructor; they are now.

Signed-off-by: George Bosilca <[email protected]>
@github-actions github-actions bot added this to the v6.0.0 milestone Apr 13, 2026
@mshanthagit mshanthagit marked this pull request as ready for review April 13, 2026 18:08
@mshanthagit
Copy link
Copy Markdown
Contributor Author

Cherry-pick from the main branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants