muxorch: fix shared-MAC neighbor recovery in updateFdb#4507
muxorch: fix shared-MAC neighbor recovery in updateFdb#4507Ndancejic wants to merge 1 commit intosonic-net:masterfrom
Conversation
When two mux neighbors share a MAC, the boolean found_existing_mux_neighbor flag gated off the fallback recovery path globally whenever loop (A) re-bound any one of them, stranding the other as a plain NeighOrch entry after FDB age-out. Replace with std::set<NeighborEntry> handled to track specific IPs loop (A) processed, and always run the fallback with a skip-if-handled filter. Adds tests/test_mux.py::test_fdb_aging_after_neighbor_shared_mac that reproduces the bug on master. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Nikola Dancejic <ndancejic@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@manamand2020 for review + visibility |
There was a problem hiding this comment.
Pull request overview
Fixes a recovery gap in MuxOrch::updateFdb() when multiple neighbors share the same MAC on a mux port (e.g., IPv4 + IPv6 on the same NIC), ensuring “stranded” NeighOrch neighbors are properly converted back into mux-managed neighbors when the FDB entry is re-learned.
Changes:
- Replace function-scoped boolean gating with per-neighbor tracking (
std::set<NeighborEntry> handled) to avoid skipping recovery for other same-MAC neighbors. - Make the fallback recovery path run for mux-port FDB learns and skip neighbors already handled by the first loop.
- Add an end-to-end regression test covering the shared-MAC + FDB aging scenario.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
orchagent/muxorch.cpp |
Adjusts updateFdb() recovery logic to track handled neighbors and attempt conversion of stranded same-MAC neighbors. |
tests/test_mux.py |
Adds a regression test reproducing shared-MAC neighbor recovery failure and validating the fix. |
| const NeighborEntry& neighbor_entry = neighbor_pair.first; | ||
| const auto& neighbor_data = neighbor_pair.second; | ||
|
|
||
| // Already re-bound by the loop above — skip. | ||
| if (handled.count(neighbor_entry)) |
There was a problem hiding this comment.
In this fallback scan, getNeighborTable() returns a const reference, but the current code assigns it to a local auto neighbors (just above this loop), which makes a full copy of the neighbor table. Please bind it as a const& (and ideally outside the VLAN loop) to avoid repeated O(N) copies/allocations on every FDB learn event.
| if (isMuxExists(update.entry.port_name)) | ||
| { | ||
| // Check if there's an existing neighbor with this MAC on any VLAN interface | ||
| // that could be converted to a MUX neighbor | ||
| auto vlan_ports = gPortsOrch->getAllVlans(); |
There was a problem hiding this comment.
The fallback block now runs on every FDB add for a mux port (if (isMuxExists(...))), which can significantly increase work compared to the prior gating, since it scans VLANs and neighbors. Consider reducing the cost by iterating the neighbor table once (instead of nested VLAN × neighbor loops) and/or adding a cheap early-exit condition so frequent FDB refresh/learn events on mux ports don’t repeatedly do a full scan when there are no stranded neighbors to convert.
What I did
Fixes a shared-MAC neighbor recovery bug in
MuxOrch::updateFdb()introduced by #4491.When two mux neighbors (e.g. standby ToR + SoC behind the same server) share a MAC, the function-scoped
bool found_existing_mux_neighborwas set by any same-MAC hit in the loop overmux_nexthop_tb_. That gated off the fallback recovery block globally, so whenever the first loop re-bound one of the neighbors, the other — sitting as a plain NeighOrch entry after FDB age-out — stayed stranded and never got promoted back intomux_nexthop_tb_.Replace the boolean with
std::set<NeighborEntry> handledthat tracks the specific IPs the first loop already re-bound. The fallback now always runs when the FDB port is a mux, usinghandledas a skip filter to avoid re-processing.Why
Reproduces like this on master:
updateFdbfires, loop (A) re-binds A, sets the flag, fallback skippedAdded
tests/test_mux.py::test_fdb_aging_after_neighbor_shared_macwhich reproduces this end-to-end and asserts both A and B are MUX-managed with tunnel routes.Notes for reviewers
for (auto vlan_alias : vlan_ports)loop in the fallback block is technically redundant —getMuxPort()already rejects non-VLAN aliases, so the explicit alias filter andgetAllVlans()iteration aren't needed. I left it alone to keep this PR focused on the bugfix; happy to do a follow-up cleanup if reviewers agree.How I tested
test_fdb_aging_after_neighbor_shared_mac— new, fails on master, passes with fix