[RFC PATCH v3 1/4] RDMA/umem: Support importing dma-buf as user memory region

Tue Oct 6 18:38:34 UTC 2020

On Tue, Oct 06, 2020 at 08:17:05PM +0200, Daniel Vetter wrote:

> So on the gpu we pipeline this all. So step 4 doesn't happen on the
> cpu, but instead we queue up a bunch of command buffers so that the
> gpu writes these pagetables (and the flushes tlbs and then does the
> actual stuff userspace wants it to do).

mlx5 HW does basically this as well.

We just apply scheduling for this work on the device, not in the CPU.

> just queue it all up and let the gpu scheduler sort out the mess. End
> result is that you get a sgt that points at stuff which very well
> might have nothing even remotely resembling your buffer in there at
> the moment. But all the copy operations are queued up, so rsn the data
> will also be there.

The explanation make sense, thanks

> But rdma doesn't work like that, so it looks all a bit funny.

Well, I guess it could, but how would it make anything better? I can
overlap building the SGL and the device PTEs with something else doing
'move', but is that a workload that needs such agressive optimization?

> This is also why the precise semantics of move_notify for gpu<->gpu
> sharing took forever to discuss and are still a bit wip, because you
> have the inverse problem: The dma api mapping might still be there

Seems like this all makes a graph of operations, can't start the next
one until all deps are finished. Actually sounds a lot like futures.

Would be clearer if this attach API provided some indication that the
SGL is actually a future valid SGL..

Jason