[RFC PATCH v3 1/4] RDMA/umem: Support importing dma-buf as user memory region

Tue Oct 6 18:17:05 UTC 2020

On Tue, Oct 6, 2020 at 8:02 PM Jason Gunthorpe <jgg at ziepe.ca> wrote:
>
> On Tue, Oct 06, 2020 at 07:24:30PM +0200, Daniel Vetter wrote:
> > On Tue, Oct 6, 2020 at 6:34 PM Daniel Vetter <daniel at ffwll.ch> wrote:
> > >
> > > On Tue, Oct 06, 2020 at 12:49:56PM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Oct 06, 2020 at 11:22:14AM +0200, Daniel Vetter wrote:
> > > > >
> > > > > For reinstanting the pages you need:
> > > > >
> > > > > - dma_resv_lock, this prevents anyone else from issuing new moves or
> > > > >   anything like that
> > > > > - dma_resv_get_excl + dma_fence_wait to wait for any pending moves to
> > > > >   finish. gpus generally don't wait on the cpu, but block the dependent
> > > > >   dma operations from being scheduled until that fence fired. But for rdma
> > > > >   odp I think you need the cpu wait in your worker here.
> > > >
> > > > Reinstating is not really any different that the first insertion, so
> > > > then all this should be needed in every case?
> > >
> > > Yes. Without move_notify we pin the dma-buf into system memory, so it
> > > can't move, and hence you also don't have to chase it. But with
> > > move_notify this all becomes possible.
> >
> > I just realized I got it wrong compared to gpus. I needs to be:
> > 1. dma_resv_lock
> > 2. dma_buf_map_attachment, which might have to move the buffer around
> > again if you're unlucky
> > 3. wait for the exclusive fence
> > 4. put sgt into your rdma ptes
> > 5 dma_resv_unlock
> >
> > Maybe also something we should document somewhere for dynamic buffers.
> > Assuming I got it right this time around ... Christian?
>
> #3 between 2 and 4 seems strange - I would expect once
> dma_buf_map_attachment() returns that the buffer can be placed in the
> ptes. It certianly can't be changed after the SGL is returned..

So on the gpu we pipeline this all. So step 4 doesn't happen on the
cpu, but instead we queue up a bunch of command buffers so that the
gpu writes these pagetables (and the flushes tlbs and then does the
actual stuff userspace wants it to do).

And all that is being blocked in the gpu scheduler on the fences we
acquire in step 3. Again we don't wait on the cpu for that fence, we
just queue it all up and let the gpu scheduler sort out the mess. End
result is that you get a sgt that points at stuff which very well
might have nothing even remotely resembling your buffer in there at
the moment. But all the copy operations are queued up, so rsn the data
will also be there.

This is also why the precise semantics of move_notify for gpu<->gpu
sharing took forever to discuss and are still a bit wip, because you
have the inverse problem: The dma api mapping might still be there in
the iommu, but the data behind it is long gone and replaced. So we
need to be really carefully with making sure that dma operations are
blocked in the gpu properly, with all the flushing and everything. I
think we've reached the conclusion that ->move_notify is allowed to
change the set of fences in the dma_resv so that these flushes and pte
writes can be queued up correctly (on many gpu you can't synchronously
flush tlbs, yay). The exporter then needs to make sure that the actual
buffer move is queued up behind all these operations too.

But rdma doesn't work like that, so it looks all a bit funny.

Anticipating your next question: Can this mean there's a bunch of
differnt dma/buffer mappings in flight for the same buffer?

Yes. We call the ghost objects, at least in the ttm helpers.

> Feels like #2 should serialize all this internally? An API that
> returns invalidate data sometimes is dangerous :)

If you use the non-dynamic mode, where we pin the buffer into system
memory at dma_buf_attach time, it kinda works like that. Also it's not
flat out invalide date, it's the most up-to-date data reflecting all
committed changes. Plus dma_resv tells you when that will actually be
reality in the hardware, not just the software tracking of what's
going on.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch