-
Notifications
You must be signed in to change notification settings - Fork 227
Open
Labels
bugSomething isn't workingSomething isn't workingneeds-follow-upIssue needs follow-upIssue needs follow-up
Description
Problem
During a DGXC benchmark run on AWS H100, NCCL fails to load the OFI network plugin and falls back to Socket transport (with terrible performance as a result):
NCCL INFO NET/Plugin: libnccl-net-ofi.so: /opt/rdma-core/build/lib/libibverbs.so.1: version `IBVERBS_PRIVATE_34' not found (required by
/usr/lib/x86_64-linux-gnu/libefa.so.1)
NCCL INFO NET/Plugin: Could not find: ofi
Appears to be a version conflict with the rdma-core libs installed:
- System (/usr/lib/x86_64-linux-gnu/): libibverbs 1.14.56.0 exporting IBVERBS_PRIVATE_34, installed via .deb from the networking image. ibverbs-providers (libefa, libmlx5, etc.) are built against this version.
- HPC-X (/opt/rdma-core/build/lib/): libibverbs 1.15.60.0 exporting IBVERBS_PRIVATE_59, from HPC-X's bundled rdma-core source build.
The version with IBVERBS_PRIVATE_59 gets loaded during job init.
Minimal repro
1. Run any DGXC benchmark recipe on an AWS system with NCCL_DEBUG=INFO.
2. Alternatively try loading the library directly in the container:
import ctypes, os
ctypes.CDLL('/opt/rdma-core/build/lib/libibverbs.so.1', mode=os.RTLD_NOW | os.RTLD_GLOBAL)
ctypes.CDLL('libnccl-net-ofi.so', mode=os.RTLD_NOW)Expected behavior
I expect it to correctly load the EFA libraries on an EFA system.
Affected area
area:build
Regression?
Yes
Environment
- NeMo Container: 26.02.00
- DGXC Benchmarks 26.02 (pre-release)
- AWS H100 cluster
Logs
pool0-1521:2460198:2460198 [3] NCCL INFO NCCL_NET_PLUGIN set by environment to ofi
pool0-1521:2460198:2460198 [3] NCCL INFO NET/Plugin: libnccl-net-ofi.so: /opt/rdma-core/build/lib/libibverbs.so.1: version `IBVERBS_PRIVATE_34' not found (required by /usr/lib/x86_64-linux-gnu/libefa.so.1)
pool0-1521:2460198:2460198 [3] NCCL INFO NET/Plugin: Could not find: ofi
pool0-1521:2460189:2460189 [0] NCCL INFO NET/IB : No device found.
pool0-1521:2460189:2460189 [0] NCCL INFO NET/IB : Using [RO]; OOB eth0:10.12.183.174<0>
pool0-1521:2460189:2460189 [0] NCCL INFO Failed to initialize NET plugin IB
pool0-1521:2460189:2460189 [0] NCCL INFO NET/Socket : Using [0]eth0:10.12.183.174<0>
pool0-1521:2460189:2460189 [0] NCCL INFO Initialized NET plugin Socket
pool0-1521:2460189:2460189 [0] NCCL INFO Could not get speed from /sys/class/net/eth0/speed. Defaulting to 10 Gbps.
pool0-1521:2460189:2460189 [0] NCCL INFO Assigned NET plugin Socket to comm
pool0-1521:2460189:2460189 [0] NCCL INFO Using network SocketReactions are currently unavailable
Metadata
Metadata
Labels
bugSomething isn't workingSomething isn't workingneeds-follow-upIssue needs follow-upIssue needs follow-up