-
Notifications
You must be signed in to change notification settings - Fork 616
Fix incorrect MNNVL fabric check #2626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Nicolas Castet <[email protected]>
a9bd042 to
3e06ef4
Compare
|
@ptrendx Can you trigger CI? |
Greptile OverviewGreptile SummaryFixed the MNNVL (Multi-Node NVLink) fabric detection logic in the Key changes:
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant has_mnnvl_fabric
participant CUDA_Driver
participant NVML
Caller->>has_mnnvl_fabric: Check MNNVL fabric support(device_id)
has_mnnvl_fabric->>CUDA_Driver: cuDeviceGet(&dev, device_id)
CUDA_Driver-->>has_mnnvl_fabric: device handle
has_mnnvl_fabric->>CUDA_Driver: cuDeviceGetAttribute(FABRIC_SUPPORTED)
CUDA_Driver-->>has_mnnvl_fabric: fabric_handle_supported
alt fabric_handle_supported
has_mnnvl_fabric->>NVML: nvmlInit_v2()
has_mnnvl_fabric->>NVML: nvmlDeviceGetHandleByIndex_v2(device_id)
NVML-->>has_mnnvl_fabric: local_device
has_mnnvl_fabric->>has_mnnvl_fabric: Initialize fabricInfo
has_mnnvl_fabric->>NVML: nvmlDeviceGetGpuFabricInfoV(local_device, &fabricInfo)
NVML-->>has_mnnvl_fabric: fabricInfo with state and clusterUuid
has_mnnvl_fabric->>NVML: nvmlShutdown()
has_mnnvl_fabric->>has_mnnvl_fabric: Create zero_uuid[NVML_GPU_FABRIC_UUID_LEN]
has_mnnvl_fabric->>has_mnnvl_fabric: Check state == COMPLETED && memcmp(clusterUuid, zero_uuid) != 0
has_mnnvl_fabric-->>Caller: mnnvl_fabric_support
else not supported
has_mnnvl_fabric-->>Caller: false
end
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No files reviewed, no comments
timmoon10
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending CI
|
/te-ci L1 |
Description
Fix incorrect MNNVL fabric check
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: