ptftests: fail advanced-reboot fast on unreachable neighbors#24359
Open
Ryangwaite wants to merge 1 commit intosonic-net:masterfrom
Open
ptftests: fail advanced-reboot fast on unreachable neighbors#24359Ryangwaite wants to merge 1 commit intosonic-net:masterfrom
Ryangwaite wants to merge 1 commit intosonic-net:masterfrom
Conversation
When a cEOS/SONiC neighbor was unreachable from the PTF (e.g. broken
container networking), advanced-reboot would hang for ~11 minutes and
report the misleading failure "DUT hasn't shutdown in 600 seconds".
Root cause: peer_state_check SSH threads wedged inside paramiko.connect
on its ~130s default timeout while wait_until_cpu_port_down blocked on
their Queue.put('cpu_down'), so the 600s control-plane-down timer
expired before any SSH error surfaced.
Signed-off-by: Ryan Garthwaite <ryangwaite@gmail.com>
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
hdwhdw
approved these changes
May 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of PR
Summary:
Fixes # (issue)
MSFT ADO: 37767011
When a cEOS/SONiC neighbor is unreachable from the PTF (e.g. broken container networking), advanced-reboot would hang for ~11 minutes and report the misleading failure "DUT hasn't shutdown in 600 seconds".
Root cause: peer_state_check SSH threads wedged inside paramiko.connect on its ~130s default timeout while wait_until_cpu_port_down blocked on their Queue.put('cpu_down'), so the 600s control-plane-down timer expired before any SSH error surfaced.
To avoid, this there is now a preliminary check that we can access the neighbours over ssh and if not, abort the test early with a more visible reason for failing.
Testing
Without this change
These logs are seen:
That 600s timeout is misleading when the true error is that the ceos neighbours couldn't be reached.
With this change
This is a saving of ~10min and it has a clearer error for debugging.
Type of change
Back port request
Approach
What is the motivation for this PR?
Save testbed resources and improve debuggability
How did you do it?
Added a preliminary neighbour check before starting the main part of the test.
How did you verify/test it?
Happy and sad path testing with the fix in place.
Any platform specific information?
N/A
Supported testbed topology if it's a new test case?
N/A
Documentation
N/A