Skip to content

Performance Improvements and Bug Fixes#662

Open
mikewiebe wants to merge 62 commits into
developfrom
int_perf_updates
Open

Performance Improvements and Bug Fixes#662
mikewiebe wants to merge 62 commits into
developfrom
int_perf_updates

Conversation

@mikewiebe
Copy link
Copy Markdown
Collaborator

@mikewiebe mikewiebe commented Apr 6, 2026

Summary

This PR delivers a broad set of performance improvements and bug fixes across the NDFC/DCNM collection, with a focus on reducing controller API volume, improving large-scale fabric workflows, and fixing several idempotence and deployment edge cases.

The main improvements introduce bulk API usage where supported, batched polling and delete workflows, cached controller lookups, and O(1) in-memory lookup maps for frequently accessed inventory, interface, VRF, network, link, resource manager, and vPC pair data. These changes significantly reduce repeated API calls and repeated list scans during merged, replaced, overridden, deleted, and query operations.

The PR also improves deployment accuracy for VRF and Network resources by tracking affected switch serials and supporting both switch-level and resource-level deploy modes. Network and VRF delete workflows now use batched readiness checks and bulk delete retry handling, making large-scale cleanup operations more reliable.

Several correctness fixes are included as well, including improved TOR attachment handling for networks, secondary gateway handling for MSD workflows, VRF VLAN and BGP password idempotence fixes, Resource Manager scope comparison fixes, interface native VLAN and defaulting fixes, and safer vPC pair caching/deploy behavior.

Overall, these changes make the modules faster, more scalable, and more predictable when operating against larger fabrics or repeated idempotent runs.

Changes

Module Updates / Performance Improvements Bug Fixes / Behavior Fixes
dcnm_interface - Added bulk interface retrieval using IF_WITH_SNO.
- Added per-switch/per-interface interface detail caching.
- Prefetches interface policy details for all switches referenced in intent before diffing.
- Splits grouped bulk interface responses into individual cache entries.
- Optimized overridden, deleted, and replaced workflows with lookup maps instead of repeated scans.
- Uses bulk interface modify endpoint when supported, including 207 Multi-Status handling.
- Flattens interface deploy payloads to reduce repeated deploy calls.
- Limits safety deploy checks to check_deploy=true cases.
- Fixed merged diff behavior when controller payloads omit optional nvPairs keys.
- Fixed native VLAN-related idempotence gaps.
- Fixed stale/default handling for port-channel and vPC member interfaces after parent operations.
- Added TOR uplink awareness so TOR uplink ports are not incorrectly defaulted during overridden/default operations.
dcnm_inventory - Added O(1) inventory lookup by management IP.
- Added cached POAP inventory lookup by serial number.
- Reuses existing inventory data for known switches and skips unnecessary reachability checks for idempotent runs.
- Groups create/import payloads by common role/configuration to reduce duplicate discover/import calls.
- Optimized rediscovery, health polling, credential matching, and role assignment using dictionaries/sets.
- Batches NDFC role assignments into a single roles API call.
- Reduced polling windows for all-switch readiness and config-save retries.
- Improved idempotent bootstrap/preprovision handling.
- Added fast-path handling for merged reruns with no new switches while still honoring save/deploy flows.
- Updated diff formatting to reflect grouped switch create/import payloads.
dcnm_links - Added bulk API support detection.
- Added bulk link update endpoint support.
- Sends all link modifications through a single bulk PUT when supported.
- Preserves existing per-link update behavior when bulk APIs are unavailable.
- Accepts both 200 OK and 207 Multi-Status for bulk link updates.
- Added clearer failure handling when the source fabric cannot be found.
- Minor unit test expectation cleanup.
dcnm_network - Added deploy_mode support for switch-level versus resource-level deploy flows.
- Added bulk network update support using /top-down/v2/bulk-update/networks where available.
- Added switch-level deploy helpers and serial-to-network deploy mapping.
- Added optimized network inventory retrieval using targeted GETs for small requests and bulk GETs for large/overridden requests.
- Added O(1) attachment lookup by network name.
- Added batched attachment readiness checks, detach/deploy handling, and network delete polling.
- Added bulk network delete with retry and partial-failure handling.
- Optimized attachment/deploy payload generation using serial attach/detach maps.
- Fixed secondary gateway handling for MSD parent workflows.
- Fixed merged TOR port handling so new TOR ports are appended without dropping existing TOR attachments.
- Fixed replaced/overridden TOR behavior so TOR ports are removed only when required.
- Improved parsing for mixed leaf/TOR portNames, residual ports, and TOR-only attachment strings.
- Added normalization for one-sided TOR intent on vPC peers.
- Fixed config-only deployment detection so network configuration changes deploy even without attachment deltas.
- Improved named-network not-found handling across NDFC and OneManage response variants.
- Removed direct /tmp/dcnm.log debug writes from the action plugin.
dcnm_resource_manager - Added bulk API support detection.
- Added bulk resource creation endpoint support.
- Added bulk payload builder for Fabric, Device, DeviceInterface, DevicePair, and Link scopes.
- Uses bulk resource create when supported and falls back to individual create otherwise.
- Accepts both 200 OK and 207 Multi-Status for bulk create/delete operations.
- Fixed resource manager idempotence by making entityType / scopeType comparisons case-insensitive.
- Preserves optional resourceValue and vrfName in generated bulk payloads.
dcnm_vpc_pair - Added VPC pair info caching by switch DB ID.
- Caches pair info under both peer DB IDs when available.
- Added sync status caching for repeated deploy checks.
- Returns deep copies from cache to prevent mutation side effects.
- Reuses cached pair info across delete/query/override flows.
- Reduces unnecessary recommendation, peer-link, and sync-status API calls.
- Skips peer-link/recommendation calls when a switch has no peer.
- Invalidates sync cache around deploy/save retry points so retry checks use fresh controller state.
- Refined deploy decision logic so create/merge paths go directly to deploy payload while no-change existing pairs use sync checks.
dcnm_vrf - Added deploy_mode support for switch-level versus resource-level deploy flows.
- Added bulk VRF update support using /top-down/v2/bulk-update/vrfs where available.
- Added switch-level deploy helpers and serial-to-VRF deploy mapping.
- Added O(1) lookup maps for VRF create, attach, and desired state.
- Optimized VRF-lite data retrieval by batching GET_VRF_SWITCH by VRF and serials.
- Filters VRF processing to playbook-targeted VRFs before attachment/switch lookups.
- Added batched VRF attachment delete readiness checks.
- Added bulk VRF delete with retry and partial-failure handling.
- Fixed VRF VLAN propagation when enableL3VniNoVlan is enabled or desired VLAN is 0.
- Fixed BGP password clear handling by skipping key-type comparison when desired password is empty.
- Fixed no-attach deploy accumulation to avoid duplicating VRFs already in deploy diffs.
- Allows OUT-OF-SYNC, FAILED, and NA terminal states during delete readiness checks.
- Rebuilds lookup maps during failure/rollback paths to keep retry state consistent.
- Action plugin now validates wrapped deploy payloads and calls shared parent deploy logic.

Test Coverage

Module Coverage Added / Updated
dcnm_interface
  • Added regression coverage for merged Ethernet interfaces when native VLAN fields are missing from controller data.
  • Updated unit fixtures to exercise bulk IF_WITH_SNO prefetch/cache behavior across Ethernet, SVI, loopback, port-channel, vPC, multi-interface, bunched-interface, and overridden flows.
  • Added grouped bulk response coverage to verify that multiple interfaces returned under one policy are split into individual cache entries.
dcnm_inventory
  • Added route-based mock handling to validate optimized API call patterns.
  • Updated merge/import tests for grouped switch create flows and batched role assignment.
  • Added coverage for the already-created switch save/deploy fast path.
  • Updated POAP, preprovision, multi-switch, brownfield/greenfield, RMA, save, and deploy test flows to match cached and batched inventory behavior.
dcnm_links
  • Corrected the unnumbered fabric fixture reference used by existing link tests.
  • Existing link merge, modify, replace, delete, query, numbered, unnumbered, IPv6, vPC, and XE integration coverage remains in place.
dcnm_network
  • Added unit coverage for secondary gateway template generation, add/update behavior, clear behavior, and ambiguous merged clear rejection.
  • Added MSD split coverage to verify secondary gateways stay parent-only for merged and are sent to children for replaced workflows.
  • Added unit coverage for dynamic delete deploy serial selection, OUT-OF-SYNC bulk delete, batched attachment delete readiness checks, and unexpected delete readiness data.
  • Added bulk inventory coverage for replace workflows.
  • Added TOR attachment coverage for TOR-only updates, vPC-to-TOR pair updates, one-sided vPC TOR intent normalization, single TOR peer updates, single-leaf/single-TOR updates, and replace/idempotence scenarios.
  • Added standalone integration test cases TC13-TC15 for TOR port preservation and idempotence across vPC leaf pair to single TOR, vPC leaf pair to TOR pair, and single leaf to single TOR topologies.
dcnm_resource_manager
  • Existing Resource Manager merge, delete, query, invalid-parameter, and sanity coverage remains in place.
  • No dedicated test file was changed for the new bulk create path in this PR.
dcnm_vpc_pair
  • Updated unit expectations to match reduced API calls from VPC pair info caching and sync-status caching.
  • Adjusted create, delete, replaced, and overridden test flows so they validate the optimized call sequence while preserving existing behavior checks.
dcnm_vrf
  • Added fixture coverage for batched VRF-lite GET_VRF_SWITCH responses across multiple switches.
  • Updated merged, replaced, overridden, deleted, query, MSD, and MCFG unit flows to validate batched VRF-lite lookups.
  • Added coverage for delete readiness polling, NA terminal VRF delete state, bulk delete retry behavior, and delete failure reporting.
  • Updated MSD/MCFG delete and override tests to include undeploy, readiness polling, and bulk delete behavior.

@mikewiebe mikewiebe changed the title Int perf updates Performance Improvements and Bug Fixes Apr 6, 2026
mikewiebe and others added 23 commits April 6, 2026 14:18
  1. Only fetch VPC pair serials for vpc or aa_fex interfaces.
  2. Add lookup tables for replaced, similar to what overridden already has.
replaced still does linear scans of self.have and self.pb_input for each interface
  1. Fix dcnm_interface unit tests based on performance refactoring
  2. dcnm_intf_compare_want_and_have(), the nvPairs update handling was indented outside the for nk in nv_keys loop, so only the last nvPair comparison could trigger action = "update"
Changes
1. Targeted State Fetch in get_have()
What changed: get_have() previously fetched all VRFs on the fabric and their attachments regardless of playbook scope. It now restricts the attachment query to only the VRFs named in the playbook when state is merged, replaced, or deleted.

Why: A fabric with 500 VRFs but a playbook targeting 5 VRFs was fetching attachments for all 500. The targeted path adds an early-return when all wanted VRFs are new (nothing to fetch at all). States overridden and query preserve full-fabric semantics as required.

Improvement: O(n) API data reduction proportional to the ratio of fabric VRFs to playbook VRFs.

2. O(1) Lookup Dictionaries for Diff Computation
What changed: Added have_create_by_name, have_attach_by_name, and want_create_by_name dictionaries (keyed by vrfName) built once after get_have() and get_want() complete. Replaced all inner for-loop scans and find_dict_in_list_by_key_value(search=self.have_*/self.want_create, ...) lookups in diff_merge_create(), diff_merge_attach(), get_diff_delete(), get_diff_replace(), and get_diff_override().

Why: Every diff method was doing O(n) linear scans through have_create/have_attach/want_create for each item in an outer loop, resulting in overall O(n²) complexity. With 100 VRFs this is 10,000 comparisons instead of 100.

Improvement: All diff-phase VRF lookups drop from O(n²) to O(n).

3. Bulk Attachment Fetch in get_diff_query()
What changed: get_diff_query() previously made one GET /vrfs/attachments API call per VRF — N calls for N VRFs. Both the config-present and no-config branches now make a single bulk call with all VRF names comma-joined, then index the results by vrfName for O(1) access in the loop.

Why: The GET_VRF_ATTACH endpoint already supports comma-separated VRF names. The per-VRF loop was an unnecessary fan-out of API calls. This mirrors the pattern already used in get_have().

Improvement: N API calls reduced to 1 per query run.

4. Early-Exit Fast Path in Delete-Readiness Polling
What changed: wait_for_vrf_del_ready() previously entered a polling while-loop for every VRF unconditionally. An initial single-pass poll now checks each VRF's state before entering the retry loop, and immediately skips VRFs already in terminal states (NA, OUT-OF-SYNC, FAILED).

Why: In the common case where VRFs are already detached and ready, every VRF was still making an initial API call inside the retry loop with sleep delays. The quick-scan pass avoids the loop overhead entirely for those VRFs.

Improvement: Eliminates unnecessary retry-loop overhead for VRFs already in terminal states.

5. Deduplication of VRF Deploy Lists
What changed: Applied sorted(set(...)) before joining all_vrfs into the vrfNames undeploy string in get_diff_delete() and get_diff_override().

Why: Defensive guard against duplicate VRF names in the undeploy payload, which could cause controller-side errors or redundant operations in edge cases.

Improvement: Prevents potential duplicate undeploy requests.

Testing:
All 63 existing unit tests pass without modification.
No changes to module interface, argument spec, or external behavior.
1. Hybrid Bulk Threshold for get_have()
Constant added: BULK_GET_HAVE_VRF_THRESHOLD = 25 at module level.

New API paths (GET_VRF_TARGETED) added to both v11 and v12 path dicts:

"/rest/top-down/fabrics/{}/vrfs?vrf-names={}"
get_have() restructured: for targeted states (merged/replaced/deleted) with explicit config:

< 25 VRFs in config: calls GET_VRF_TARGETED directly via dcnm_send, fetching only the named VRFs — avoids downloading the full fabric VRF payload.
≥ 25 VRFs in config: falls back to get_vrf_objects() (bulk GET_VRF) + in-memory filter — one bulk call is cheaper than many individual calls for large configs.
Non-targeted states (overridden/query/deleted-without-config): full-fabric sweep unchanged.

2. Batch VRF-Lite Switch Lookups
All three call sites where get_vrf_lite_objects() was called per-switch now use a pre-fetch batch per VRF instead:

get_have() outer loop: before processing the per-switch inner loop for a given VRF, collects all serial numbers for that VRF and makes one GET_VRF_SWITCH call with all serial numbers comma-joined. Results are indexed by serialNumber in lite_by_sn, used for O(1) lookup in the inner loop.

get_diff_query() config branch: before the per-switch inner loop for each want_c, makes one batch call with all switches of that VRF. Results in lite_by_sn dict; per-switch lite_attach is reconstructed from the indexed batch response.

get_diff_query() no-config branch: same pattern — one batch call per fabric VRF with all its switches.

Test updates: Added two combined fixture entries (mock_vrf_attach_get_ext_object_combined_sn1_sn2 and mock_vrf_attach_get_ext_object_combined_sn1_sn4) to the JSON fixtures, and updated the 11 test cases that exercise VRF-Lite playbook configs to use one combined mock instead of two sequential mocks
…call reduction)

- Split deletion readiness checks into separate attachment and object status methods
- Remove duplicated quick-scan logic in VRF deletion flow
- Use bulk GET APIs for checking all pending resources in single call per retry
- Reduces deletion polling from N×M API calls to M calls (N=resources, M=retries)
mikewiebe and others added 25 commits April 27, 2026 23:01
…& Action Plugin + Log Module Addition for Networks.

Co-authored-by: Copilot <[email protected]>
…ze Pending Deploy in Networks

Co-authored-by: Copilot <[email protected]>
Performance Bulk Updates + Handling
Enable actions on int_perf_updates

Remove before merge to develop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants