Supporting Linux kernel TCP zero-copy functionality

Hi

I am trying to figure out whether it's feasible to add support for TCP zero-copy send and receive into UCT, with the ultimate goal of supporting zero-copy directly to and from GPU memory using the "device memory" support in the Linux kernel.

I am struggling a bit to wrap my head around how the different layers interact, and thus what exactly would be needed to add this support, so I'm looking for guidance and/or help with implementing this. I wrote up [an overview of how the Linux kernel zero-copy support works](https://blog.tohojo.dk/2026/02/the-inner-workings-of-tcp-zero-copy.html), but I'll try to summarise what is needed to use zero-copy in different scenarios here (simplifying a little):

- TX zero-copy from user memory: Just supply the `MSG_ZEROCOPY` to `sendmsg()`, and make sure to keep the buffer around until the kernel signals completion
- TX zero-copy from device (GPU) memory: Bind a memory region to a network device transmit queue using a `dmabuf` file descriptor, then use `sendmsg(MSG_ZEROCOPY)` to transmit from offsets into that buffer.
- RX zero-copy: Bind a memory region (either userspace memory, or a `dmabuf` file descriptor) to a NIC receive queue, and enable TCP header split on the NIC. The kernel will then allocate the memory pages passed to the NIC from the bound memory region, **for all data received on that queue**. Userspace gets notifications of incoming data fragments using `recvmsg()`.

Given these constraints, my current understanding of how this would fit into UCX is as follows:

- ZC TX from userspace should be fairly straight forward; UCT already uses `sendmsg()` in its `zcopy` operations, so it's more or less just a matter of adding the `MSG_ZEROCOPY` flag there.
- ZC TX from device memory requires pre-registration of the memory region; AFAICT, there are existing APIs to enable this, but I have not been able to wrap my head around what exactly is needed to enable these for the TCP transport.
- For ZC RX, it AFAICT a separate buffer needs to be registered for each transfer operation, and the transfer needs to happen over a separate TCP connection that can be steered to the right hardware queue. This sounds a little bit like what the "rendezvous" mechanism is for, but I'm also struggling to figure out exactly how that is activated.

Could someone please help me with some pointers for whether my understanding is correct, and how to make progress on adding this functionality? And comment on whether this is something that you would be interested in having support upstream as well, of course! :)

Many thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supporting Linux kernel TCP zero-copy functionality #11260

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Supporting Linux kernel TCP zero-copy functionality #11260

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions