Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]

Fri Nov 30 13:07:54 PST 2012

On Fri, Nov 30, 2012 at 09:35:49PM +0100, Thomas Hellstrom wrote:
> On 11/30/2012 08:25 PM, Jerome Glisse wrote:
> >On Fri, Nov 30, 2012 at 07:31:04PM +0100, Thomas Hellstrom wrote:
> >>On 11/30/2012 07:07 PM, Jerome Glisse wrote:
> >>>On Fri, Nov 30, 2012 at 06:43:36PM +0100, Thomas Hellstrom wrote:
> >>>>On 11/30/2012 06:18 PM, Jerome Glisse wrote:
> >>>>>On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom <thomas at shipmail.org> wrote:
> >>>>>>On 11/30/2012 05:30 PM, Jerome Glisse wrote:
> >>>>>>>On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom <thomas at shipmail.org>
> >>>>>>>wrote:
> >>>>>>>>On 11/29/2012 10:58 PM, Marek Olšák wrote:
> >>>>>>>>>What I tried to point out was that the synchronization shouldn't be
> >>>>>>>>>needed, because the CPU shouldn't do anything with the contents of
> >>>>>>>>>evicted buffers. The GPU moves the buffers, not the CPU. What does the
> >>>>>>>>>CPU do besides updating some kernel structures?
> >>>>>>>>>
> >>>>>>>>>Also, buffer deletion is something where you don't need to wait for
> >>>>>>>>>the buffer to become idle if you know the memory area won't be
> >>>>>>>>>mapped by the CPU, ever. The memory can be reclaimed right away. It
> >>>>>>>>>would be the GPU to move new data in and once that happens, the old
> >>>>>>>>>buffer will be trivially idle, because single-ring GPUs execute
> >>>>>>>>>commands in order.
> >>>>>>>>>
> >>>>>>>>>Marek
> >>>>>>>>Actually asynchronous eviction / deletion is something I have been
> >>>>>>>>prototyping for a while but never gotten around to implement in TTM:
> >>>>>>>>
> >>>>>>>>There are a few minor caveats:
> >>>>>>>>
> >>>>>>>>With buffer deletion, what you say is true for fixed memory, but not for
> >>>>>>>>TT
> >>>>>>>>memory where pages are reclaimed by the system after buffer destruction.
> >>>>>>>>That means that we don't have to wait for idle to free GPU space, but we
> >>>>>>>>need to wait before pages are handed back to the system.
> >>>>>>>>
> >>>>>>>>Swapout needs to access the contents of evicted buffers, but
> >>>>>>>>synchronizing
> >>>>>>>>doesn't need to happen until just before swapout.
> >>>>>>>>
> >>>>>>>>Multi-ring - CPU support: If another ring / engine or the CPU is about to
> >>>>>>>>move in buffer contents to VRAM or a GPU aperture that was previously
> >>>>>>>>evicted by another ring, it needs to sync with that eviction, but doesn't
> >>>>>>>>know what buffer or even which buffers occupied the space previously.
> >>>>>>>>Trivially one can attach a sync object to the memory type manager that
> >>>>>>>>represents the last eviction from that memory type, and *any* engine (CPU
> >>>>>>>>or
> >>>>>>>>GPU) that moves buffer contents in needs to order that movement with
> >>>>>>>>respect
> >>>>>>>>to that fence. As you say, with a single ring and no CPU fallbacks, that
> >>>>>>>>ordering is a no-op, but any common (non-driver based) implementation
> >>>>>>>>needs
> >>>>>>>>to support this.
> >>>>>>>>
> >>>>>>>>A single fence attached to the memory type manager is the simplest
> >>>>>>>>solution,
> >>>>>>>>but a solution with a fence for each free region in the free list is also
> >>>>>>>>possible. Then TTM needs a driver callback to be able order fences w r t
> >>>>>>>>echother.
> >>>>>>>>
> >>>>>>>>/Thomas
> >>>>>>>>
> >>>>>>>Radeon already handle multi-ring and ttm interaction with what we call
> >>>>>>>semaphore. Semaphore are created to synchronize with fence accross
> >>>>>>>different ring. I think the easiest solution is to just remove the bo
> >>>>>>>wait in ttm and let driver handle this.
> >>>>>>The wait can be removed, but only conditioned on a driver flag that says it
> >>>>>>supports unsynchronous buffer moves.
> >>>>>>
> >>>>>>The multi-ring case I'm talking about is:
> >>>>>>
> >>>>>>Ring 1 evicts buffer A, emits fence 0
> >>>>>>Ring 2 evicts buffer B, emits fence 1
> >>>>>>..Other eviction takes place by various rings, perhaps including ring 1 and
> >>>>>>ring 2.
> >>>>>>Ring 3 moves buffer C into the space which happens bo be the union of the
> >>>>>>space prevously occupied buffer A and buffer B.
> >>>>>>
> >>>>>>Question is: which fence do you want to order this move with?
> >>>>>>The answer is whichever of fence 0 and 1 signals last.
> >>>>>>
> >>>>>>I think it's a reasonable thing for TTM to keep track of this, but in order
> >>>>>>to do so it needs a driver callback that
> >>>>>>can order two fences, and can order a job in the current ring w r t a fence.
> >>>>>>In radeon's case that driver callback
> >>>>>>would probably insert a barrier / semaphore. In the case of simpler hardware
> >>>>>>it would wait on one of the fences.
> >>>>>>
> >>>>>>/Thomas
> >>>>>>
> >>>>>I don't think we can order fence easily with a clean api, i would
> >>>>>rather see ttm provide a list of fence to driver and tell to the
> >>>>>driver before moving this object all the fence on this list need to be
> >>>>>completed. I think it's as easy as associating fence with drm_mm (well
> >>>>>nouveau as its own mm stuff) but idea would basicly be that fence are
> >>>>>both associated with bo and with mm object so you know when a segment
> >>>>>of memory is idle/available for use.
> >>>>>
> >>>>>Cheers,
> >>>>>Jerome
> >>>>Hmm. Agreed that would save a lot of barriers.
> >>>>
> >>>>Even if TTM tracks fences by free mm regions or a single fence for
> >>>>the whole memory type, it's a simple fact that fences from the same
> >>>>ring are trivially ordered, which means such a list should contain at
> >>>>most as many fences as there are rings.
> >>>Yes, one function callback is needed to know which fence is necessary,
> >>>also ttm needs to know the number of rings (note that i think newer
> >>>hw will have somethings like 1024 rings or even more, even today hw
> >>>might have as many as i think nvidia channel is pretty much what i
> >>>define to be a ring).
> >>>
> >>>But i think most case will be few fence accross few rings. Like 1
> >>>ring is the dma ring and then you have a ring for one of the GL
> >>>context that using the memory and another ring for the new context
> >>>that want to use the memory.
> >>>
> >>>>So, whatever approach is chosen, TTM needs to be able to determine
> >>>>that trivial ordering, and I think the upcoming cross-device fencing
> >>>>work will face the exact same problem.
> >>>>
> >>>>My proposed ordering API would look something like
> >>>>
> >>>>struct fence *order_fences(struct fence *fence_a, struct fence
> >>>>*fence_b, bool trivial_order, bool interruptible, bool no_wait_gpu)
> >>>>
> >>>>Returns which of the fences @fence_a and @fence_b that when
> >>>>signaled, guarantees that also the other
> >>>>fence has signaled. If @quick_order is true, and the driver cannot
> >>>>trivially order the fences, it may return ERR_PTR(-EAGAIN),
> >>>>if @interruptible is true, any wait should be performed
> >>>>interruptibly and if no_wait_gpu is true, the function is not
> >>>>allowed to wait on gpu but should issue ERR_PTR(-EBUSY) if it needs
> >>>>to do so to order fences.
> >>>>
> >>>>(Hardware without semaphores can't order fences without waiting on them).
> >>>>
> >>>>The list approach you suggest would use @trivial_order = true,
> >>>>Single fence approach would use @trivial_order = false.
> >>>>
> >>>>And a first simple implementation in TTM would perhaps use your list
> >>>>approach with a single list for the whole memory type.
> >>>>
> >>>>/Thomas
> >>>I would rather add a callback like :
> >>>
> >>>ttm_reduce_fences(unsigned *nfences, fence **fencearray)
> >>I don't agree here. I think the fence order function is more
> >>versatile and a good abstraction that can be
> >>applied in a number of cases to this problem. Anyway we should sync
> >>this with Maarten and his
> >>fence work. The same problem applies to attaching shared fences to a bo.
> >Radeon already handle the bo case on multi-ring and i think it should
> >be left to the driver to do what its necessary there.
> >
> >I don't think here more versatility is bad, in fact i am pretty sure
> >that no hw can give fence ordering and i also think that if driver have
> >to track multi-ring fence ordering it will just waste resource for no
> >good reasons. For tracking that in radeon i will need to keep a list
> >of semaphore and knows which semaphores insure synchronization btw
> >which ring and for each which fence are concerned just thinking to it
> >it would be a messy graph with tons of nodes.
> >
> >Of course here i am thinking in term of newer GPU with tons of rings,
> >GPU with one ring, which are a vanishing category as newer opencl
> >require more ring, are easy to handle but i really don't think we
> >should design for those.
> The biggest problem I see with ttm_reduce_fences() is that it seems
> to have high complexity, since it
> doesn't know which fence is the new one. And if it did, it would
> only be a multi-fence version of
> order_fences(trivial=true).

I am sure all driver with multi-ring will store ring id in there fence
structure, that's from my pov a requirement. So when you get a list of
fence you first go over fence on the same ring and so far all fence
implementation use increasing sequence number for fence. So reducing
becomes as easy as only leaving the most recent fence for each of the
ring with an active fence. At the same time it could check (assuming
it's a quick operation) if fence is already signaled or not.

For radeon this function would be very small, a simple imbricated loop
with couple test in the loop.

> >I think here we will have to agree to disagree but i see a fence array
> >as a perfect solution that is flexible enough and use the less resource
> >both in term of cpu usage and memory (assuming proper implementation).
> I thought it would always require nrings*8 bytes on each memory block?

As i said i think there won't be much more than 3/4 ring on average for
each block of memory. So idea is probably to alloc for 8ring and resize
in rare case where we need more.

It sound very unlikely to have more than 3/4 rings with active fence
at any point in time. You would have a dma ring, 3d ring for one client,
maybe a 3d ring for another client, video decoder ring. Also driver can
do optimization like virtualizing fence for each partition of the gpu
this would drasticly reduce the number of fence like 1 for dma, 1 for
each partition of gpu, 1 for the video decoder, 1 for display. Given
that there can't be that many partition (i think on radeon you can
only partition per CU which is something like 4 or 8).

In other word when you have n rings that do works on m partitions of
the gpu you only needs m fences categorie as you know work on each
partition will complete in order.

> >
> >I think here there should be a granularity something like don't merge block
> >over 16M or don't merge block if they have more than N fences if merged
> >together. Of course it means a special allocator but as my work on radeon
> >memory placement shows i think that the whole vram business (which is really
> >where the multi-fence will be used the most) needs a worker thread that
> >schedule works every now and then, this worker thread could also merge mm
> >when fence complete ...
> >
> >But i really don't think there should be a list for the whole manager, think
> >about the case of a very long cl running job that is still running but already
> >have it's eviction scheduled and it mm vram node marked as free with a fence.
> >If we have a list per manager than vram becomes unavailable for as long as
> >this cl job runs. I am thinking here to newer cpu that can partition there
> >compute unit (CL 1.2 spec mandate that iirc) for instance btw cl and gl so the
> >gpu still can run other job while this cl things is going on.
> 
> Yeah, I'm not saying this is an efficient solution. I'm saying it's
> the simplest solution. We can make
> a data structure arbitrarily big out of this, or perhaps provide a
> selection of algorithms to
> choose from.
> 
> /Thomas

Customizing drm_mm to handle doing allocation from several adjacent
block doesn't sounds too complex, we could introduce virtual mm
block that group all adjacent block and keep an array of fence
for each of the subblock with active fence. But yes this could be
done as a second step.

Cheers,
Jerome