Skip to content

muxorch: fix shared-MAC neighbor recovery in updateFdb#4507

Open
Ndancejic wants to merge 1 commit intosonic-net:masterfrom
Ndancejic:fix/muxorch-shared-mac-aging
Open

muxorch: fix shared-MAC neighbor recovery in updateFdb#4507
Ndancejic wants to merge 1 commit intosonic-net:masterfrom
Ndancejic:fix/muxorch-shared-mac-aging

Conversation

@Ndancejic
Copy link
Copy Markdown
Contributor

@Ndancejic Ndancejic commented Apr 22, 2026

What I did

Fixes a shared-MAC neighbor recovery bug in MuxOrch::updateFdb() introduced by #4491.

When two mux neighbors (e.g. standby ToR + SoC behind the same server) share a MAC, the function-scoped bool found_existing_mux_neighbor was set by any same-MAC hit in the loop over mux_nexthop_tb_. That gated off the fallback recovery block globally, so whenever the first loop re-bound one of the neighbors, the other — sitting as a plain NeighOrch entry after FDB age-out — stayed stranded and never got promoted back into mux_nexthop_tb_.

Replace the boolean with std::set<NeighborEntry> handled that tracks the specific IPs the first loop already re-bound. The fallback now always runs when the FDB port is a mux, using handled as a skip filter to avoid re-processing.

Why

Reproduces like this on master:

  1. FDB primed → neighbor A added, gets MUX-managed
  2. FDB ages out (del_fdb)
  3. Neighbor B added while FDB is absent → stranded as plain NeighOrch entry
  4. FDB re-learned → updateFdb fires, loop (A) re-binds A, sets the flag, fallback skipped
  5. B stays stranded: no tunnel route, no ASIC neighbor entry

Added tests/test_mux.py::test_fdb_aging_after_neighbor_shared_mac which reproduces this end-to-end and asserts both A and B are MUX-managed with tunnel routes.

Notes for reviewers

  • The outer for (auto vlan_alias : vlan_ports) loop in the fallback block is technically redundant — getMuxPort() already rejects non-VLAN aliases, so the explicit alias filter and getAllVlans() iteration aren't needed. I left it alone to keep this PR focused on the bugfix; happy to do a follow-up cleanup if reviewers agree.
  • No change to the first loop's logic — only the gating bool was replaced.

How I tested

  • test_fdb_aging_after_neighbor_shared_mac — new, fails on master, passes with fix
  • Existing mux tests pass locally

When two mux neighbors share a MAC, the boolean found_existing_mux_neighbor
flag gated off the fallback recovery path globally whenever loop (A)
re-bound any one of them, stranding the other as a plain NeighOrch entry
after FDB age-out. Replace with std::set<NeighborEntry> handled to track
specific IPs loop (A) processed, and always run the fallback with a
skip-if-handled filter.

Adds tests/test_mux.py::test_fdb_aging_after_neighbor_shared_mac
that reproduces the bug on master.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Nikola Dancejic <ndancejic@microsoft.com>
Copilot AI review requested due to automatic review settings April 22, 2026 23:06
@Ndancejic Ndancejic requested a review from prsunny as a code owner April 22, 2026 23:06
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@Ndancejic
Copy link
Copy Markdown
Contributor Author

@manamand2020 for review + visibility

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a recovery gap in MuxOrch::updateFdb() when multiple neighbors share the same MAC on a mux port (e.g., IPv4 + IPv6 on the same NIC), ensuring “stranded” NeighOrch neighbors are properly converted back into mux-managed neighbors when the FDB entry is re-learned.

Changes:

  • Replace function-scoped boolean gating with per-neighbor tracking (std::set<NeighborEntry> handled) to avoid skipping recovery for other same-MAC neighbors.
  • Make the fallback recovery path run for mux-port FDB learns and skip neighbors already handled by the first loop.
  • Add an end-to-end regression test covering the shared-MAC + FDB aging scenario.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
orchagent/muxorch.cpp Adjusts updateFdb() recovery logic to track handled neighbors and attempt conversion of stranded same-MAC neighbors.
tests/test_mux.py Adds a regression test reproducing shared-MAC neighbor recovery failure and validating the fix.

Comment thread orchagent/muxorch.cpp
Comment on lines 1874 to +1878
const NeighborEntry& neighbor_entry = neighbor_pair.first;
const auto& neighbor_data = neighbor_pair.second;

// Already re-bound by the loop above — skip.
if (handled.count(neighbor_entry))
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this fallback scan, getNeighborTable() returns a const reference, but the current code assigns it to a local auto neighbors (just above this loop), which makes a full copy of the neighbor table. Please bind it as a const& (and ideally outside the VLAN loop) to avoid repeated O(N) copies/allocations on every FDB learn event.

Copilot uses AI. Check for mistakes.
Comment thread orchagent/muxorch.cpp
Comment on lines +1860 to 1864
if (isMuxExists(update.entry.port_name))
{
// Check if there's an existing neighbor with this MAC on any VLAN interface
// that could be converted to a MUX neighbor
auto vlan_ports = gPortsOrch->getAllVlans();
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback block now runs on every FDB add for a mux port (if (isMuxExists(...))), which can significantly increase work compared to the prior gating, since it scans VLANs and neighbors. Consider reducing the cost by iterating the neighbor table once (instead of nested VLAN × neighbor loops) and/or adding a cheap early-exit condition so frequent FDB refresh/learn events on mux ports don’t repeatedly do a full scan when there are no stranded neighbors to convert.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants