Skip to content

ptftests: fail advanced-reboot fast on unreachable neighbors#24359

Open
Ryangwaite wants to merge 1 commit intosonic-net:masterfrom
Ryangwaite:early-exit-advanced-reboot-when-no-peer-connectivity
Open

ptftests: fail advanced-reboot fast on unreachable neighbors#24359
Ryangwaite wants to merge 1 commit intosonic-net:masterfrom
Ryangwaite:early-exit-advanced-reboot-when-no-peer-connectivity

Conversation

@Ryangwaite
Copy link
Copy Markdown
Contributor

Description of PR

Summary:
Fixes # (issue)
MSFT ADO: 37767011

When a cEOS/SONiC neighbor is unreachable from the PTF (e.g. broken container networking), advanced-reboot would hang for ~11 minutes and report the misleading failure "DUT hasn't shutdown in 600 seconds".

Root cause: peer_state_check SSH threads wedged inside paramiko.connect on its ~130s default timeout while wait_until_cpu_port_down blocked on their Queue.put('cpu_down'), so the 600s control-plane-down timer expired before any SSH error surfaced.

To avoid, this there is now a preliminary check that we can access the neighbours over ssh and if not, abort the test early with a more visible reason for failing.

Testing

Without this change

These logs are seen:

2026-05-01 00:28:48 : Error in HostDevice: Traceback (most recent call last):
  File "/root/ptftests/py3/advanced-reboot.py", line 1768, in peer_state_check
    self.fails[ip], self.info[ip], self.cli_info[ip], self.logs_info[ip], self.lacp_pdu_times[ip] = ssh.run()
                                                                                                    ^^^^^^^^^
  File "/root/ptftests/py3/arista.py", line 142, in run
    self.connect()
  File "/root/ptftests/py3/arista.py", line 52, in connect
    self.conn.connect(self.ip, username=self.login,
  File "/root/env-python3/lib/python3.11/site-packages/paramiko/client.py", line 384, in connect
    sock.connect(addr)
TimeoutError: [Errno 110] Connection timed out
...
2026-05-01 00:37:29 : Reachability watcher - checking VLAN GW IP
2026-05-01 00:37:30 : DUT hasn't shutdown in 600 seconds: Traceback (most recent call last):
  File "/root/ptftests/py3/advanced-reboot.py", line 427, in timeout
    res = async_res.get(timeout=seconds)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/pool.py", line 770, in get
    raise TimeoutError
multiprocessing.context.TimeoutError

2026-05-01 00:37:30 : ==================================================

Ran 1 test in 678.613s

That 600s timeout is misleading when the true error is that the ceos neighbours couldn't be reached.

With this change

======================================================================
ERROR: advanced-reboot.ReloadTest
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/ptftests/py3/advanced-reboot.py", line 1475, in runTest
    self._verify_neighbors_reachable()
  File "/root/ptftests/py3/advanced-reboot.py", line 1472, in _verify_neighbors_reachable
    raise RuntimeError(msg)
RuntimeError: Neighbor SSH unreachable from PTF: [('172.16.131.32', 'TimeoutError: timed out'), ('172.16.131.33', 'TimeoutError: timed out'), ('172.16.131.34', 'TimeoutError: timed out'), ('172.16.131.35', 'TimeoutError: timed out')]

----------------------------------------------------------------------
Ran 1 test in 87.427s

FAILED (errors=1)

This is a saving of ~10min and it has a clearer error for debugging.

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

Save testbed resources and improve debuggability

How did you do it?

Added a preliminary neighbour check before starting the main part of the test.

How did you verify/test it?

Happy and sad path testing with the fix in place.

Any platform specific information?

N/A

Supported testbed topology if it's a new test case?

N/A

Documentation

N/A

When a cEOS/SONiC neighbor was unreachable from the PTF (e.g. broken
container networking), advanced-reboot would hang for ~11 minutes and
report the misleading failure "DUT hasn't shutdown in 600 seconds".

Root cause: peer_state_check SSH threads wedged inside paramiko.connect
on its ~130s default timeout while wait_until_cpu_port_down blocked on
their Queue.put('cpu_down'), so the 600s control-plane-down timer
expired before any SSH error surfaced.

Signed-off-by: Ryan Garthwaite <ryangwaite@gmail.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@Ryangwaite Ryangwaite requested review from hdwhdw and ryanzhu706 May 1, 2026 04:53
@Ryangwaite Ryangwaite self-assigned this May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants