[RFC PATCH v2 0/3] RDMA: add dma-buf support

Thu Jul 2 13:10:00 UTC 2020

On Wed, Jul 01, 2020 at 02:15:24PM -0300, Jason Gunthorpe wrote:
> On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:
> > > >> All you need is the ability to stop wait for ongoing accesses to end and
> > > >> make sure that new ones grab a new mapping.
> > > > Swap and flush isn't a general HW ability either..
> > > >
> > > > I'm unclear how this could be useful, it is guarenteed to corrupt
> > > > in-progress writes?
> > > >
> > > > Did you mean pause, swap and resume? That's ODP.
> > >
> > > Yes, something like this. And good to know, never heard of ODP.
> > 
> > Hm I thought ODP was full hw page faults at an individual page
> > level,
> 
> Yes
> 
> > and this stop&resume is for the entire nic. Under the hood both apply
> > back-pressure on the network if a transmission can't be received,
> > but
> 
> NIC's don't do stop and resume, blocking the Rx pipe is very
> problematic and performance destroying.
> 
> The strategy for something like ODP is more complex, and so far no NIC
> has deployed it at any granularity larger than per-page.
> 
> > So since Jason really doesn't like dma_fence much I think for rdma
> > synchronous it is. And it shouldn't really matter, since waiting for a
> > small transaction to complete at rdma wire speed isn't really that
> > long an operation. 
> 
> Even if DMA fence were to somehow be involved, how would it look?

Well above you're saying it would be performance destroying, but let's
pretend that's not a problem :-) Also, I have no clue about rdma, so this
is really just the flow we have on the gpu side.

0. rdma driver maintains a list of all dma-buf that it has mapped
somewhere and is currently using for transactions

1. rdma driver gets a dma_buf->notify_move callback on one of these
buffers. To handle that it:
	1. stops hw access somehow at the rx
	2. flushes caches and whatever else is needed
	3. moves the unmapped buffer on a special list or marks it in some
	different way as unavailable
	4. launch the kthread/work_struct to fix everything back up

2. dma-buf export (gpu driver) can now issue the commands to move the
buffer around

3. rdma driver worker gets busy to restart rx:
	1. lock all dma-buf that are currently in use (dma_resv_lock).
	thanks to ww_mutex deadlock avoidance this is possible
	2. for any buffers which have been marked as unavailable in 1.3
	grab a new mapping (which might now be in system memory, or again
	peer2peer but different address)
	3. restart hw and rx
	4. unlock all dma-buf locks (dma_resv_unlock)

There is a minor problem because step 2 only queues up the entire buffer
moves behind a pile of dma_fence, and atm we haven't made it absolutely
clear who's responsible for waiting for those to complete. For gpu drivers
it's the importer since gpu drivers don't have big qualms about
dma_fences, so 3.2. would perhaps also include a dma_fence_wait to make
sure the buffer move has actually completed.

Above flow is more or less exactly what happens for gpu workloads where we
can preempt running computations. Instead of stopping rx we preempt the
compute job and remove it from the scheduler queues, and instead of
restarting rx we just put the compute job back onto the scheduler queue as
eligible for gpu time. Otherwise it's exactly the same stuff.

Of course if you only have a single compute job and too many such
interruptions, then performance is going to tank. Don't do that, instead
make sure you have enough vram or system memory or whatever :-)

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch