Skip to content

Supporting Linux kernel TCP zero-copy functionality #11260

@tohojo

Description

@tohojo

Hi

I am trying to figure out whether it's feasible to add support for TCP zero-copy send and receive into UCT, with the ultimate goal of supporting zero-copy directly to and from GPU memory using the "device memory" support in the Linux kernel.

I am struggling a bit to wrap my head around how the different layers interact, and thus what exactly would be needed to add this support, so I'm looking for guidance and/or help with implementing this. I wrote up an overview of how the Linux kernel zero-copy support works, but I'll try to summarise what is needed to use zero-copy in different scenarios here (simplifying a little):

  • TX zero-copy from user memory: Just supply the MSG_ZEROCOPY to sendmsg(), and make sure to keep the buffer around until the kernel signals completion
  • TX zero-copy from device (GPU) memory: Bind a memory region to a network device transmit queue using a dmabuf file descriptor, then use sendmsg(MSG_ZEROCOPY) to transmit from offsets into that buffer.
  • RX zero-copy: Bind a memory region (either userspace memory, or a dmabuf file descriptor) to a NIC receive queue, and enable TCP header split on the NIC. The kernel will then allocate the memory pages passed to the NIC from the bound memory region, for all data received on that queue. Userspace gets notifications of incoming data fragments using recvmsg().

Given these constraints, my current understanding of how this would fit into UCX is as follows:

  • ZC TX from userspace should be fairly straight forward; UCT already uses sendmsg() in its zcopy operations, so it's more or less just a matter of adding the MSG_ZEROCOPY flag there.
  • ZC TX from device memory requires pre-registration of the memory region; AFAICT, there are existing APIs to enable this, but I have not been able to wrap my head around what exactly is needed to enable these for the TCP transport.
  • For ZC RX, it AFAICT a separate buffer needs to be registered for each transfer operation, and the transfer needs to happen over a separate TCP connection that can be steered to the right hardware queue. This sounds a little bit like what the "rendezvous" mechanism is for, but I'm also struggling to figure out exactly how that is activated.

Could someone please help me with some pointers for whether my understanding is correct, and how to make progress on adding this functionality? And comment on whether this is something that you would be interested in having support upstream as well, of course! :)

Many thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions