Skip to content

UCT/CUDA/COPY: Qualify cuda_copy for RNDV with peer-failure EH#11319

Open
pentschev wants to merge 3 commits intoopenucx:masterfrom
pentschev:intra-process-cuda-copy-fix
Open

UCT/CUDA/COPY: Qualify cuda_copy for RNDV with peer-failure EH#11319
pentschev wants to merge 3 commits intoopenucx:masterfrom
pentschev:intra-process-cuda-copy-fix

Conversation

@pentschev
Copy link
Copy Markdown
Contributor

@pentschev pentschev commented Apr 1, 2026

What?

Relax peer-failure and RMD invalidation wireup requirements when an endpoint targets the same UCP worker as the local side (same unpacked-address UUID), including memtype endpoints used for intra-worker GPU copies.

Changes:

  1. select.c / ucp_wireup_fill_peer_err_criteria(): If unpacked_addr->uuid == worker->uuid, do not add UCT_IFACE_FLAG_ERRHANDLE_PEER_FAILURE to local iface criteria. Thread worker + unpacked address through ucp_wireup_fill_aux_criteria(), ucp_wireup_select_wireup_msg_lane(), and ucp_wireup_select_aux_transport() so auxiliary / wireup-msg selection follows the same rule.
  2. select.c / ucp_wireup_add_rma_bw_lanes(): Add UCT_MD_FLAG_INVALIDATE_RMA to RMA BW MD criteria only when the remote address is not the same worker (select_params->address->uuid != ep->worker->uuid), in addition to the existing peer-failure + RNDV checks.
  3. ucp_request.c / ucp_request_get_invalidation_map(): If UCP_EP_CONFIG_KEY_FLAG_SELF is set on the EP config key, return an empty invalidation map. Same-worker RMA BW lanes may use MDs without UCT_MD_FLAG_INVALIDATE_RMA (e.g. cuda_copy) without tripping asserts on peer-failure teardown.

Why?

Previously, peer-failure + RNDV wireup required iface peer-failure caps and MD RMA invalidation for every EP, including connections where the "remote" is the same worker (loopback, memtype EPs for host/device staging). cuda_copy does not expose those UCT capabilities, so it was dropped from RMA BW selection and users saw cuda_ipc or worse paths for intra-worker CUDA work, even though there is no separate remote worker whose failure must invalidate cross-node RMA keys.

For same-worker EPs, UCP_EP_CONFIG_KEY_FLAG_SELF already marks the configuration; skipping strict UCT requirements in wireup and skipping MD invalidation in ucp_request_get_invalidation_map() matches the real semantics: no independent remote peer for that connection, so the stricter cross-peer invalidation contract is unnecessary. Cross-worker endpoints are unchanged and still require the full iface + MD behavior.

Closes #11318

Advertise UCT_IFACE_FLAG_ERRHANDLE_PEER_FAILURE so
ucp_wireup_fill_peer_err_criteria does not exclude cuda_copy from
RMA BW lanes.

Advertise UCT_MD_FLAG_INVALIDATE and UCT_MD_FLAG_INVALIDATE_RMA and
handle UCT_MD_MEM_DEREG_FLAG_INVALIDATE in mem dereg (invoke
completion), matching ucp_wireup_add_rma_bw_lanes and
ucp_memh_dereg expectations.
@gleon99 gleon99 requested review from iyastreb and shasson5 April 4, 2026 13:41
Comment on lines +107 to +109
/* UCT_IFACE_FLAG_ERRHANDLE_PEER_FAILURE required for RMA BW wireup
* (ucp_wireup_fill_peer_err_criteria) when error handling is requested.
* Transfers are local copies; UCP handles invalidation when a peer fails. */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should change UCP layer to not require peer failure or invalidate support when connecting endpoint to same worker, including mem type endpoints. instead of changing UCT.
@shasson5 can help if needed

@pentschev
Copy link
Copy Markdown
Contributor Author

Thanks @yosefe @shasson5 . I've reverted the UCT changes and pushed a fix on the UCP-layer. Could you please review?

When an endpoint is wired to the same UCP worker (same unpacked address
UUID), there is no independent remote peer for cross-worker RMA. Skip
requiring UCT_IFACE_FLAG_ERRHANDLE_PEER_FAILURE and
UCT_MD_FLAG_INVALIDATE_RMA in RMA BW wireup criteria in that case. In
ucp_request_get_invalidation_map(), return an empty invalidation map for
UCP_EP_CONFIG_KEY_FLAG_SELF so RMA BW lanes without MD invalidate
support (e.g. cuda_copy) remain valid for same-worker / memtype EPs with
peer-failure error handling. Extend ucp_wireup_fill_peer_err_criteria()
and ucp_wireup_fill_aux_criteria() with worker + unpacked address so
auxiliary and other lane selection paths apply the same rule.
@pentschev pentschev force-pushed the intra-process-cuda-copy-fix branch from eec6960 to 4fd4eb4 Compare April 9, 2026 12:54
Copy link
Copy Markdown
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test failure seems relevant

Comment thread src/ucp/wireup/select.c
Comment on lines 1049 to +1055
if (ep_init_flags & UCP_EP_INIT_ERR_MODE_PEER_FAILURE) {
/* No independent remote worker when connecting an EP to itself (loopback,
* memtype EPs, etc.): peer-failure iface caps and MD invalidation are
* relaxed in wireup; see ucp_request_get_invalidation_map(). */
if ((unpacked_addr != NULL) && (unpacked_addr->uuid == worker->uuid)) {
return;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would be better to update ucp_ep_err_mode_init_flags to not set UCP_EP_INIT_ERR_MODE_PEER_FAILURE for uuid==remote_uuid case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cuda_copy excluded from RMA BW lanes when peer-failure error handling is enabled

2 participants