You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[RCCL] Fix netOverride being skipped when rail trees are
enabled (restore desired NIC mapping for targeted 4-NIC systems) (#4726)
## Motivation
This PR resolves a performance issue where enabling rail-optimized trees
caused an early return, skipping the netOverride logic on affected
systems.
On some 4-NIC MI3xx systems, this manifested as:
- enabling rail trees appearing to regress ring performance
- inconsistent behavior depending on whether rail trees were selected
`netOverride` is required for pairing GPUs with no PXB path to NICs on
these systems.
## Technical Details
1. Remove early return after treeRail success
2. Refactor override into helper `applyNetOverride(system,
romeTopoModels[i].options);`
3. Apply override after graph construction
4. Now runs regardless of what ring/tree are matched
5. Add runtime control `RCCL_DISABLE_NET_OVERRIDE` for debugging issues
with netOverride due to hard assumptions about system layout
## JIRA ID
ROCM-3200
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[rocm-systems] ROCm/rocm-systems#4726 (commit f66e1c0)
0 commit comments