Skip to content

UCP/EP/FT: handle the case when no more lanes to fallback #11351

Open
evgeny-leksikov wants to merge 4 commits intoopenucx:masterfrom
evgeny-leksikov:ucp_ft_all_lanes_failure
Open

UCP/EP/FT: handle the case when no more lanes to fallback #11351
evgeny-leksikov wants to merge 4 commits intoopenucx:masterfrom
evgeny-leksikov:ucp_ft_all_lanes_failure

Conversation

@evgeny-leksikov
Copy link
Copy Markdown
Contributor

What?

Handle the case when no more lanes to fallback, fixes:

  • recursive request restart from zcopy completion
  • am header leak
  • ep_config ref counter fix for ifaces reactivation in lanes discard flow

Why?

to fix various failures like hang, segfaults and assertions

- recursive request restart from zcopy completion
- am header leak
- ep_config ref counter fix for ifaces reactivation in lanes discard flow
@evgeny-leksikov evgeny-leksikov changed the title Ucp ft all lanes failure UCP/EP/FT: handle the case when no more lanes to fallback Apr 16, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Handles failover edge-cases when an endpoint runs out of fallback lanes, aiming to prevent recursive restarts, fix AM header handling during failures, and correct worker interface reactivation/refcount behavior during lane discard/reactivation flows.

Changes:

  • Gate protocol restart on zcopy completion when the endpoint is already in FAILED state, and improve restart/reset assertions for easier debugging.
  • Extend ucp_ep_set_cfg_index() with a reactivate flag and adjust wireup/reconfig/discard flows to avoid incorrect iface (de)activation/refcounting.
  • Update AM eager multi-zcopy to cope with “no AM lane available” scenarios and adjust FT tests to cover exhaustion behavior.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
test/gtest/ucp/test_ucp_fault_tolerance.cc Updates AM failure-injection logic/tests for “no more lanes to fallback” scenarios and log/error expectations.
src/ucp/wireup/wireup_cm.c Adapts to new ucp_ep_set_cfg_index(..., reactivate) signature.
src/ucp/wireup/wireup.c Adapts to new ucp_ep_set_cfg_index(..., reactivate) signature.
src/ucp/proto/proto_common.inl Prevents zcopy restart when EP is already failed.
src/ucp/proto/proto_common.c Improves assertion diagnostics when proto reset fails during restart.
src/ucp/core/ucp_ep.h Updates ucp_ep_set_cfg_index() API to include reactivate parameter.
src/ucp/core/ucp_ep.c Implements conditional iface reactivation in ucp_ep_set_cfg_index() and adjusts discard/reconfig flows.
src/ucp/am/eager_multi.c Adds “no AM lane” handling in eager multi-zcopy init/reset/completion.
src/ucp/am/eager.inl Adds tracing for AM header descriptor allocation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ucp/am/eager_multi.c Outdated
Comment thread test/gtest/ucp/test_ucp_fault_tolerance.cc
Comment thread src/ucp/core/ucp_ep.h
Comment thread src/ucp/am/eager_multi.c Outdated
/**
* At least 1 non-failed AM lane must be available.
*/
if (ucp_ep_get_am_lane(req->send.ep) == UCP_NULL_LANE) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid extra branch in data path?

Comment thread src/ucp/am/eager_multi.c
send.state.uct_comp);

if (ucs_likely(self->status == UCS_OK) ||
(req->send.ep->flags & UCP_EP_FLAG_FAILED)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why needed, do we set flag FAILED in case of failover?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only when all lanes are failed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to avoid leak of descriptor with copy of user AM header

Comment thread src/ucp/am/eager_multi.c Outdated
return status;
}

return (ucp_ep_get_am_lane(req->send.ep) == UCP_NULL_LANE) ?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it mean if ucp_ep_get_am_lane(req->send.ep) == UCP_NULL_LANE?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means that the are no live AM lanes after reconfiguration.
I converted such places to assertion since proto selection logic should not lead to this situation.

UCS_TEST_MESSAGE << "Keep 1 live lane for UD transports since "
<< "local error injection on all lanes leads to "
"failed assertion in ud_ep_purge";
num_lanes_to_fail--;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can num_lanes_to_fail be 0 or 1, before the subtraction?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, shuffle_lanes at very begin skips the test if num_lanes < 2

short_progress_loop();
size_t min_expected_err_count = 1;
if ((failure_side == FAILURE_SIDE_TARGET) &&
has_transport("dc_x")) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document DC special case

Comment thread src/ucp/core/ucp_ep.c Outdated
status = ucp_ep_reconfig_internal(ucp_ep, failed_lanes);
if (status != UCS_OK) {
ucs_assertv(ucp_ep->cfg_index == old_cfg_index,
"ep %p: cfg_index %u -> %u after reconfiguration error %s", ucp_ep, old_cfg_index,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: long line

Comment thread src/ucp/am/eager.inl Outdated
return UCS_ERR_NO_MEMORY;
}

ucp_trace_req(req, "allocating AM header descriptor 0x%p", reg_desc);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: 0x is redundant.

add_variant_with_value(variants, UCP_FEATURE_AM, TEST_OP_AM,
op_name(TEST_OP_AM));
add_variant_with_value(variants, UCP_FEATURE_AM | UCP_FEATURE_RMA, TEST_OP_AM | TEST_OP_FLUSH,
add_variant_with_value(variants, UCP_FEATURE_AM, TEST_OP_AM | TEST_OP_FLUSH,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UCP_FEATURE_RMA was added by mistaken earlier for AM test. Do you want to open separate PR for this trivial fix?

Comment thread src/ucp/am/eager_multi.c Outdated
}

return (ucp_ep_get_am_lane(req->send.ep) == UCP_NULL_LANE) ?
UCS_ERR_UNREACHABLE : UCS_OK;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The func returns OK if ucp_am_proto_request_zcopy_reset is OK, AND no AM lanes?
Needs at least commenting..

size_t min_expected_err_count = 1;
if ((failure_side == FAILURE_SIDE_TARGET) &&
has_transport("dc_x")) {
min_expected_err_count = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can skip the test, the next assert is always true.

}

ucs_status_t do_am_send_and_wait(ucp_ep_h ep, size_t size, bool flush_after) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: remove

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants