UCP/EP/FT: handle the case when no more lanes to fallback #11351
UCP/EP/FT: handle the case when no more lanes to fallback #11351evgeny-leksikov wants to merge 4 commits intoopenucx:masterfrom
Conversation
- recursive request restart from zcopy completion - am header leak - ep_config ref counter fix for ifaces reactivation in lanes discard flow
There was a problem hiding this comment.
Pull request overview
Handles failover edge-cases when an endpoint runs out of fallback lanes, aiming to prevent recursive restarts, fix AM header handling during failures, and correct worker interface reactivation/refcount behavior during lane discard/reactivation flows.
Changes:
- Gate protocol restart on zcopy completion when the endpoint is already in
FAILEDstate, and improve restart/reset assertions for easier debugging. - Extend
ucp_ep_set_cfg_index()with areactivateflag and adjust wireup/reconfig/discard flows to avoid incorrect iface (de)activation/refcounting. - Update AM eager multi-zcopy to cope with “no AM lane available” scenarios and adjust FT tests to cover exhaustion behavior.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| test/gtest/ucp/test_ucp_fault_tolerance.cc | Updates AM failure-injection logic/tests for “no more lanes to fallback” scenarios and log/error expectations. |
| src/ucp/wireup/wireup_cm.c | Adapts to new ucp_ep_set_cfg_index(..., reactivate) signature. |
| src/ucp/wireup/wireup.c | Adapts to new ucp_ep_set_cfg_index(..., reactivate) signature. |
| src/ucp/proto/proto_common.inl | Prevents zcopy restart when EP is already failed. |
| src/ucp/proto/proto_common.c | Improves assertion diagnostics when proto reset fails during restart. |
| src/ucp/core/ucp_ep.h | Updates ucp_ep_set_cfg_index() API to include reactivate parameter. |
| src/ucp/core/ucp_ep.c | Implements conditional iface reactivation in ucp_ep_set_cfg_index() and adjusts discard/reconfig flows. |
| src/ucp/am/eager_multi.c | Adds “no AM lane” handling in eager multi-zcopy init/reset/completion. |
| src/ucp/am/eager.inl | Adds tracing for AM header descriptor allocation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /** | ||
| * At least 1 non-failed AM lane must be available. | ||
| */ | ||
| if (ucp_ep_get_am_lane(req->send.ep) == UCP_NULL_LANE) { |
There was a problem hiding this comment.
can we avoid extra branch in data path?
| send.state.uct_comp); | ||
|
|
||
| if (ucs_likely(self->status == UCS_OK) || | ||
| (req->send.ep->flags & UCP_EP_FLAG_FAILED)) { |
There was a problem hiding this comment.
why needed, do we set flag FAILED in case of failover?
There was a problem hiding this comment.
Only when all lanes are failed
There was a problem hiding this comment.
Needed to avoid leak of descriptor with copy of user AM header
| return status; | ||
| } | ||
|
|
||
| return (ucp_ep_get_am_lane(req->send.ep) == UCP_NULL_LANE) ? |
There was a problem hiding this comment.
what does it mean if ucp_ep_get_am_lane(req->send.ep) == UCP_NULL_LANE?
There was a problem hiding this comment.
It means that the are no live AM lanes after reconfiguration.
I converted such places to assertion since proto selection logic should not lead to this situation.
| UCS_TEST_MESSAGE << "Keep 1 live lane for UD transports since " | ||
| << "local error injection on all lanes leads to " | ||
| "failed assertion in ud_ep_purge"; | ||
| num_lanes_to_fail--; |
There was a problem hiding this comment.
Can num_lanes_to_fail be 0 or 1, before the subtraction?
There was a problem hiding this comment.
No, shuffle_lanes at very begin skips the test if num_lanes < 2
| short_progress_loop(); | ||
| size_t min_expected_err_count = 1; | ||
| if ((failure_side == FAILURE_SIDE_TARGET) && | ||
| has_transport("dc_x")) { |
There was a problem hiding this comment.
Please document DC special case
| status = ucp_ep_reconfig_internal(ucp_ep, failed_lanes); | ||
| if (status != UCS_OK) { | ||
| ucs_assertv(ucp_ep->cfg_index == old_cfg_index, | ||
| "ep %p: cfg_index %u -> %u after reconfiguration error %s", ucp_ep, old_cfg_index, |
| return UCS_ERR_NO_MEMORY; | ||
| } | ||
|
|
||
| ucp_trace_req(req, "allocating AM header descriptor 0x%p", reg_desc); |
| add_variant_with_value(variants, UCP_FEATURE_AM, TEST_OP_AM, | ||
| op_name(TEST_OP_AM)); | ||
| add_variant_with_value(variants, UCP_FEATURE_AM | UCP_FEATURE_RMA, TEST_OP_AM | TEST_OP_FLUSH, | ||
| add_variant_with_value(variants, UCP_FEATURE_AM, TEST_OP_AM | TEST_OP_FLUSH, |
There was a problem hiding this comment.
UCP_FEATURE_RMA was added by mistaken earlier for AM test. Do you want to open separate PR for this trivial fix?
| } | ||
|
|
||
| return (ucp_ep_get_am_lane(req->send.ep) == UCP_NULL_LANE) ? | ||
| UCS_ERR_UNREACHABLE : UCS_OK; |
There was a problem hiding this comment.
The func returns OK if ucp_am_proto_request_zcopy_reset is OK, AND no AM lanes?
Needs at least commenting..
| size_t min_expected_err_count = 1; | ||
| if ((failure_side == FAILURE_SIDE_TARGET) && | ||
| has_transport("dc_x")) { | ||
| min_expected_err_count = 0; |
There was a problem hiding this comment.
Can skip the test, the next assert is always true.
| } | ||
|
|
||
| ucs_status_t do_am_send_and_wait(ucp_ep_h ep, size_t size, bool flush_after) { | ||
|
|
What?
Handle the case when no more lanes to fallback, fixes:
Why?
to fix various failures like hang, segfaults and assertions