Skip to content

UCT/DEVICE: use CPU mem as AMO local buf#11345

Open
jeynmann wants to merge 8 commits intoopenucx:masterfrom
jeynmann:amo_cpu_buf
Open

UCT/DEVICE: use CPU mem as AMO local buf#11345
jeynmann wants to merge 8 commits intoopenucx:masterfrom
jeynmann:amo_cpu_buf

Conversation

@jeynmann
Copy link
Copy Markdown
Contributor

@jeynmann jeynmann commented Apr 15, 2026

What?

Use CPU memory as the AMO swap local buffer.

Why?

By using CPU memory as the AMO local buffer, we're able to prevent CUDA context retain during iface init.

Nodes: 2
Ranks: 2 * 8
Experts: 2 * 8 * 16
Mode: rdma only
Tokens: 2,4,8,16,32,64,128,256,512

Amo local buf on cpu:

token D+C BW (GB/s) Dispatch (GB/s) Combine (GB/s) Disp send (us) Disp recv (us) Comb send (us) Comb recv (us)
2 2.88 2.41 3.86 13.71 6.55 15.39 8.88
4 7.33 6.29 9.44 14.16 6.67 15.49 10.00
8 14.17 12.80 17.12 13.93 6.82 15.67 10.56
16 23.47 21.75 26.66 14.41 6.88 16.13 10.72
32 32.61 31.38 35.46 15.56 7.05 17.36 11.04
64 40.14 39.97 41.55 16.57 7.90 19.92 11.84
128 44.30 45.15 44.61 19.05 9.90 27.51 16.53
256 46.67 47.95 46.51 27.60 16.66 45.69 27.68
512 47.78 49.37 47.00 42.86 30.60 77.54 50.03

Amo local buf on gpu:

token D+C BW (GB/s) Dispatch (GB/s) Combine (GB/s) Disp send (us) Disp recv (us) Comb send (us) Comb recv (us)
2 2.87 2.40 3.85 13.64 6.55 15.36 8.88
4 7.34 6.30 9.41 14.15 6.68 15.47 10.00
8 14.20 12.67 17.13 13.90 6.82 15.64 10.55
16 23.48 21.57 26.68 14.39 6.89 16.12 10.71
32 32.62 31.81 35.24 15.52 7.06 17.41 11.03
64 40.14 40.09 41.58 16.60 7.90 19.96 11.86
128 44.38 45.21 44.81 19.07 9.90 27.44 16.53
256 46.58 48.02 46.41 27.58 16.70 45.72 27.69
512 47.72 49.55 46.90 42.82 30.60 77.59 50.02

@jeynmann jeynmann changed the title UCT/DEVICE: use cpu buff for amo UCT/DEVICE: use CPU mem as AMO local buf Apr 15, 2026
Comment thread src/uct/ib/mlx5/gdaki/gdaki.c Outdated
}

status = uct_rc_gdaki_reg_mr(&md->super, self->atomic_buff,
status = uct_rc_gdaki_reg_mr(&md->super, &self->atomic_buff,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can call uct_ib_reg_mr directly without dma_buf

}

if (self->ep_alloc_mode == UCT_RC_GDAKI_EP_ALLOC_MODE_POOL) {
status = uct_rc_gdaki_iface_init_channel_pool(self, config);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next PR we can probably move init channel pool to ep creation and also retain cuda ctx there.

rakhmets
rakhmets previously approved these changes Apr 16, 2026
Comment thread src/uct/ib/mlx5/gdaki/gdaki.c Outdated
return dmabuf_supported;
}

#if HAVE_DECL_MLX5DV_UMEM_MASK_DMABUF
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move this function completely inside uct_rc_gdaki_umem_reg?

Comment thread src/uct/ib/mlx5/gdaki/gdaki.h Outdated
struct ibv_mr *atomic_mr;
CUdeviceptr atomic_raw;
uint64_t *atomic_buff;
uint64_t atomic_buff;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd consider allocating separate atomic buffer with posix_memalign(size=8 align=UCS_SYS_CACHE_LINE_SIZE)

@yosefe
Copy link
Copy Markdown
Contributor

yosefe commented Apr 16, 2026

  1. Failure seems relevant
  2. Can you pls add error prints to gdaki.c so that it will be clear why "UCX ERROR failed to get device_ep for lane=2" error happens ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants