Skip to content

[bug] EFA/OFI plugin fails to load — Nemo 26.02.00 container #2824

@sudostock

Description

@sudostock

Problem

During a DGXC benchmark run on AWS H100, NCCL fails to load the OFI network plugin and falls back to Socket transport (with terrible performance as a result):

  NCCL INFO NET/Plugin: libnccl-net-ofi.so: /opt/rdma-core/build/lib/libibverbs.so.1: version `IBVERBS_PRIVATE_34' not found (required by
  /usr/lib/x86_64-linux-gnu/libefa.so.1)
  NCCL INFO NET/Plugin: Could not find: ofi

Appears to be a version conflict with the rdma-core libs installed:

  • System (/usr/lib/x86_64-linux-gnu/): libibverbs 1.14.56.0 exporting IBVERBS_PRIVATE_34, installed via .deb from the networking image. ibverbs-providers (libefa, libmlx5, etc.) are built against this version.
  • HPC-X (/opt/rdma-core/build/lib/): libibverbs 1.15.60.0 exporting IBVERBS_PRIVATE_59, from HPC-X's bundled rdma-core source build.

The version with IBVERBS_PRIVATE_59 gets loaded during job init.

Minimal repro

1. Run any DGXC benchmark recipe on an AWS system with NCCL_DEBUG=INFO.
2. Alternatively try loading the library directly in the container:

import ctypes, os
ctypes.CDLL('/opt/rdma-core/build/lib/libibverbs.so.1', mode=os.RTLD_NOW | os.RTLD_GLOBAL)
ctypes.CDLL('libnccl-net-ofi.so', mode=os.RTLD_NOW)

Expected behavior

I expect it to correctly load the EFA libraries on an EFA system.

Affected area

area:build

Regression?

Yes

Environment

  • NeMo Container: 26.02.00
  • DGXC Benchmarks 26.02 (pre-release)
  • AWS H100 cluster

Logs

pool0-1521:2460198:2460198 [3] NCCL INFO NCCL_NET_PLUGIN set by environment to ofi
pool0-1521:2460198:2460198 [3] NCCL INFO NET/Plugin: libnccl-net-ofi.so: /opt/rdma-core/build/lib/libibverbs.so.1: version `IBVERBS_PRIVATE_34' not found (required by /usr/lib/x86_64-linux-gnu/libefa.so.1)
pool0-1521:2460198:2460198 [3] NCCL INFO NET/Plugin: Could not find: ofi
pool0-1521:2460189:2460189 [0] NCCL INFO NET/IB : No device found.
pool0-1521:2460189:2460189 [0] NCCL INFO NET/IB : Using [RO]; OOB eth0:10.12.183.174<0>
pool0-1521:2460189:2460189 [0] NCCL INFO Failed to initialize NET plugin IB
pool0-1521:2460189:2460189 [0] NCCL INFO NET/Socket : Using [0]eth0:10.12.183.174<0>
pool0-1521:2460189:2460189 [0] NCCL INFO Initialized NET plugin Socket
pool0-1521:2460189:2460189 [0] NCCL INFO Could not get speed from /sys/class/net/eth0/speed. Defaulting to 10 Gbps.
pool0-1521:2460189:2460189 [0] NCCL INFO Assigned NET plugin Socket to comm
pool0-1521:2460189:2460189 [0] NCCL INFO Using network Socket

Metadata

Metadata

Labels

bugSomething isn't workingneeds-follow-upIssue needs follow-up

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions