[RFC PATCH v2 0/3] RDMA: add dma-buf support

Thu Jul 2 18:15:40 UTC 2020

On Thu, Jul 02, 2020 at 04:50:32PM +0200, Christian König wrote:
> Am 02.07.20 um 15:29 schrieb Jason Gunthorpe:
> > On Thu, Jul 02, 2020 at 03:10:00PM +0200, Daniel Vetter wrote:
> > > On Wed, Jul 01, 2020 at 02:15:24PM -0300, Jason Gunthorpe wrote:
> > > > On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:
> > > > > > > > All you need is the ability to stop wait for ongoing accesses to end and
> > > > > > > > make sure that new ones grab a new mapping.
> > > > > > > Swap and flush isn't a general HW ability either..
> > > > > > > 
> > > > > > > I'm unclear how this could be useful, it is guarenteed to corrupt
> > > > > > > in-progress writes?
> > > > > > > 
> > > > > > > Did you mean pause, swap and resume? That's ODP.
> > > > > > Yes, something like this. And good to know, never heard of ODP.
> > > > > Hm I thought ODP was full hw page faults at an individual page
> > > > > level,
> > > > Yes
> > > > 
> > > > > and this stop&resume is for the entire nic. Under the hood both apply
> > > > > back-pressure on the network if a transmission can't be received,
> > > > > but
> > > > NIC's don't do stop and resume, blocking the Rx pipe is very
> > > > problematic and performance destroying.
> > > > 
> > > > The strategy for something like ODP is more complex, and so far no NIC
> > > > has deployed it at any granularity larger than per-page.
> > > > 
> > > > > So since Jason really doesn't like dma_fence much I think for rdma
> > > > > synchronous it is. And it shouldn't really matter, since waiting for a
> > > > > small transaction to complete at rdma wire speed isn't really that
> > > > > long an operation.
> > > > Even if DMA fence were to somehow be involved, how would it look?
> > > Well above you're saying it would be performance destroying, but let's
> > > pretend that's not a problem :-) Also, I have no clue about rdma, so this
> > > is really just the flow we have on the gpu side.
> > I see, no, this is not workable, the command flow in RDMA is not at
> > all like GPU - what you are a proposing is a global 'stop the whole
> > chip' Tx and Rx flows for an undetermined time. Not feasible

Yeah, I said I have no clue about rdma :-)

> > What we can do is use ODP techniques and pause only the MR attached to
> > the DMA buf with the process you outline below. This is not so hard to
> > implement.
> 
> Well it boils down to only two requirements:
> 
> 1. You can stop accessing the memory or addresses exported by the DMA-buf.
> 
> 2. Before the next access you need to acquire a new mapping.
> 
> How you do this is perfectly up to you. E.g. you can stop everything, just
> prevent access to this DMA-buf, or just pause the users of this DMA-buf....

Yeah in a gpu we also don't stop the entire world, only the context that
needs the buffer. If there's other stuff to run, we do keep running that.
Usually the reason for a buffer move is that we do actually have other
stuff that needs to be run, and which needs more vram for itself, so we
might have to throw out a few buffers from vram that can also be placed in
system memory.

Note that a modern gpu has multiple engines and most have or are gaining
hw scheduling of some sort, so there's a pile of concurrent gpu context
execution going on at any moment. The days where we just push gpu work
into a fifo are (mostly) long gone.

> > > 3. rdma driver worker gets busy to restart rx:
> > > 	1. lock all dma-buf that are currently in use (dma_resv_lock).
> > > 	thanks to ww_mutex deadlock avoidance this is possible
> > Why all? Why not just lock the one that was invalidated to restore the
> > mappings? That is some artifact of the GPU approach?
> 
> No, but you must make sure that mapping one doesn't invalidate others you
> need.
> 
> Otherwise you can end up in a nice live lock :)

Also if you don't have pagefaults, but have to track busy memory at a
context level, you do need to grab all locks of all buffers you need, or
you'd race. There's nothing stopping a concurrent ->notify_move on some
other buffer you'll need otherwise, and if you try to be clever and roll
you're own locking, you'll anger lockdep - you're own lock will have to be
on both sides of ww_mutex or it wont work, and that deadlocks.

So ww_mutex multi-lock dance or bust.

> > And why is this done with work queues and locking instead of a
> > callback saying the buffer is valid again?
> 
> You can do this as well, but a work queue is usually easier to handle than a
> notification in an interrupt context of a foreign driver.

Yeah you can just install a dma_fence callback but
- that's hardirq context
- if you don't have per-page hw faults you need the multi-lock ww_mutex
  dance anyway to avoid races.

On top of that the exporter has no idea whether the importer still really
needs the mapping, or whether it was just kept around opportunistically.
pte setup isn't free, so by default gpus keep everything mapped. At least
for the classic gl/vk execution model, compute like rocm/amdkfd is a bit
different here.

Cheers, Daniel

> 
> Regards,
> Christian.
> 
> > 
> > Jason
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch