Skip to content

Conversation

@nvcastet
Copy link
Contributor

Description

Fix incorrect MNNVL fabric check

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Nicolas Castet <[email protected]>
@nvcastet
Copy link
Contributor Author

@ptrendx Can you trigger CI?

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 27, 2026

Greptile Overview

Greptile Summary

Fixed the MNNVL (Multi-Node NVLink) fabric detection logic in the has_mnnvl_fabric function. The previous implementation incorrectly treated the clusterUuid byte array as a null-terminated C-string by only checking the first byte, which could lead to false positives. The fix properly compares the entire UUID byte array using memcmp against a zero-initialized array.

Key changes:

  • Removed manual null-terminator initialization of fabricInfo.clusterUuid[0]
  • Changed state check from >= to == for NVML_GPU_FABRIC_STATE_COMPLETED (more precise matching)
  • Replaced single-byte check with proper memcmp comparison across the entire NVML_GPU_FABRIC_UUID_LEN array

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk - it fixes a clear bug in UUID comparison logic
  • The change corrects a clear logic error where a byte array was being treated as a C-string. The fix is straightforward, well-scoped, and follows proper byte array comparison practices. No breaking changes or side effects expected.
  • No files require special attention

Important Files Changed

Filename Overview
transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers-host.cpp Fixed MNNVL fabric UUID check to properly compare byte array instead of treating it as C-string, and corrected state comparison to exact match

Sequence Diagram

sequenceDiagram
    participant Caller
    participant has_mnnvl_fabric
    participant CUDA_Driver
    participant NVML

    Caller->>has_mnnvl_fabric: Check MNNVL fabric support(device_id)
    has_mnnvl_fabric->>CUDA_Driver: cuDeviceGet(&dev, device_id)
    CUDA_Driver-->>has_mnnvl_fabric: device handle
    has_mnnvl_fabric->>CUDA_Driver: cuDeviceGetAttribute(FABRIC_SUPPORTED)
    CUDA_Driver-->>has_mnnvl_fabric: fabric_handle_supported
    
    alt fabric_handle_supported
        has_mnnvl_fabric->>NVML: nvmlInit_v2()
        has_mnnvl_fabric->>NVML: nvmlDeviceGetHandleByIndex_v2(device_id)
        NVML-->>has_mnnvl_fabric: local_device
        has_mnnvl_fabric->>has_mnnvl_fabric: Initialize fabricInfo
        has_mnnvl_fabric->>NVML: nvmlDeviceGetGpuFabricInfoV(local_device, &fabricInfo)
        NVML-->>has_mnnvl_fabric: fabricInfo with state and clusterUuid
        has_mnnvl_fabric->>NVML: nvmlShutdown()
        has_mnnvl_fabric->>has_mnnvl_fabric: Create zero_uuid[NVML_GPU_FABRIC_UUID_LEN]
        has_mnnvl_fabric->>has_mnnvl_fabric: Check state == COMPLETED && memcmp(clusterUuid, zero_uuid) != 0
        has_mnnvl_fabric-->>Caller: mnnvl_fabric_support
    else not supported
        has_mnnvl_fabric-->>Caller: false
    end
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending CI

@timmoon10
Copy link
Collaborator

/te-ci L1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants