Skip to content

Bug: [Warm-reboot] lag_keepalive may send LACPDUs more than 60s after retry-count update #26758

@YairRaviv

Description

@YairRaviv

Is it platform specific

generic

Importance or Severity

Critical

Description of the bug

During warm-reboot:

  1. teamd_increase_retry-count.py sets LAGs retry-count to 5 instead of 3 (150s instead of 90s), and notice SONiC peer/s on partner retry-count 5.
  2. lag_keepalive.py sends LACPDUs every 1 second to all LAG peers, which resets the peer's partner retry-count back to 3.
  3. 0016-block-retry-count-changes.patch is blocking the peer from resetting the retry count back to 3 for 60 seconds after receiving teamd_increase_retry-count.py packet.

The failure is in the scenario:

  1. warm-reboot shutdown is still in progress, and lag_keepalive sends LACPDUs more than 60 sec after the last retry-count notification sent to a peer (so the peer resets the partner-retry-count back to 3).
  2. Control plane downtime is higher than 90 sec (as happens on some systems and described in this issue)
  3. LAGs are flapped

Steps to Reproduce

Run warm reboot with LAGs connected to Sonic peers.

Actual Behavior and Expected Behavior

Actual behavior - warm reboot test failure:
2026-04-12 12:34:25 : FAILED:<ip>:LAG flapped 1 times on 10.213.80.125 after warm boot

Expected behavior -
The extended retry count shouldn't be reverted, and LAGs timeout should be 150 seconds

Relevant log output

First LACPDU:
2026 Apr 12 12:27:14.390114 sonic INFO lag_keepalive: ready to send LACPDU packets via dict_keys(['Ethernet88', 'Ethernet100', 'Ethernet24', 'Ethernet28', 'Ethernet96', 'Ethernet92', 'Ethernet36', 'Ethernet32'])
2026 Apr 12 12:27:14.404324 sonic INFO lag_keepalive: sent LACPDU packets via dict_keys(['Ethernet88', 'Ethernet100', 'Ethernet24', 'Ethernet28', 'Ethernet96', 'Ethernet92', 'Ethernet36', 'Ethernet32'])

First LAG retry-count notification:
Apr 12 12:27:52.853087 ARISTA05T1 DEBUG teamd#teamd_PortChannel1[24]: Ethernet1: LACPDU version changed from 1 to 241
Apr 12 12:27:53.072572 ARISTA05T1 DEBUG teamd#teamd_PortChannel1[24]: Ethernet1: ignoring resetting retry count to 3
Apr 12 12:27:53.072748 ARISTA05T1 DEBUG teamd#teamd_PortChannel1[24]: Ethernet1: LACPDU version changed from 241 to 1
Apr 12 12:27:53.073757 ARISTA05T1 DEBUG teamd#teamd_PortChannel1[24]: Ethernet1: ignoring resetting retry count to 3


Last LAG retry-count notification:
Apr 12 12:28:07.720858 ARISTA06T1 DEBUG teamd#teamd_PortChannel1[23]: Ethernet1: LACPDU version changed from 1 to 241
Apr 12 12:28:08.145569 ARISTA06T1 DEBUG teamd#teamd_PortChannel1[23]: Ethernet1: ignoring resetting retry count to 3
Apr 12 12:28:08.145569 ARISTA06T1 DEBUG teamd#teamd_PortChannel1[23]: Ethernet1: LACPDU version changed from 241 to 1

Last LACPDU:
2026 Apr 12 12:28:55.077712 sonic INFO lag_keepalive: sent LACPDU packets via dict_keys(['Ethernet88', 'Ethernet100', 'Ethernet24', 'Ethernet28', 'Ethernet96', 'Ethernet92', 'Ethernet36', 'Ethernet32'])

Output of show version, show techsupport

Attach files (if any)

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions