Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]

Fri Nov 30 09:08:15 PST 2012

On 11/30/2012 05:30 PM, Jerome Glisse wrote:
> On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom <thomas at shipmail.org> wrote:
>> On 11/29/2012 10:58 PM, Marek Olšák wrote:
>>>
>>> What I tried to point out was that the synchronization shouldn't be
>>> needed, because the CPU shouldn't do anything with the contents of
>>> evicted buffers. The GPU moves the buffers, not the CPU. What does the
>>> CPU do besides updating some kernel structures?
>>>
>>> Also, buffer deletion is something where you don't need to wait for
>>> the buffer to become idle if you know the memory area won't be
>>> mapped by the CPU, ever. The memory can be reclaimed right away. It
>>> would be the GPU to move new data in and once that happens, the old
>>> buffer will be trivially idle, because single-ring GPUs execute
>>> commands in order.
>>>
>>> Marek
>>
>> Actually asynchronous eviction / deletion is something I have been
>> prototyping for a while but never gotten around to implement in TTM:
>>
>> There are a few minor caveats:
>>
>> With buffer deletion, what you say is true for fixed memory, but not for TT
>> memory where pages are reclaimed by the system after buffer destruction.
>> That means that we don't have to wait for idle to free GPU space, but we
>> need to wait before pages are handed back to the system.
>>
>> Swapout needs to access the contents of evicted buffers, but synchronizing
>> doesn't need to happen until just before swapout.
>>
>> Multi-ring - CPU support: If another ring / engine or the CPU is about to
>> move in buffer contents to VRAM or a GPU aperture that was previously
>> evicted by another ring, it needs to sync with that eviction, but doesn't
>> know what buffer or even which buffers occupied the space previously.
>> Trivially one can attach a sync object to the memory type manager that
>> represents the last eviction from that memory type, and *any* engine (CPU or
>> GPU) that moves buffer contents in needs to order that movement with respect
>> to that fence. As you say, with a single ring and no CPU fallbacks, that
>> ordering is a no-op, but any common (non-driver based) implementation needs
>> to support this.
>>
>> A single fence attached to the memory type manager is the simplest solution,
>> but a solution with a fence for each free region in the free list is also
>> possible. Then TTM needs a driver callback to be able order fences w r t
>> echother.
>>
>> /Thomas
>>
> Radeon already handle multi-ring and ttm interaction with what we call
> semaphore. Semaphore are created to synchronize with fence accross
> different ring. I think the easiest solution is to just remove the bo
> wait in ttm and let driver handle this.

The wait can be removed, but only conditioned on a driver flag that says 
it supports unsynchronous buffer moves.

The multi-ring case I'm talking about is:

Ring 1 evicts buffer A, emits fence 0
Ring 2 evicts buffer B, emits fence 1
..Other eviction takes place by various rings, perhaps including ring 1 
and ring 2.
Ring 3 moves buffer C into the space which happens bo be the union of 
the space prevously occupied buffer A and buffer B.

Question is: which fence do you want to order this move with?
The answer is whichever of fence 0 and 1 signals last.

I think it's a reasonable thing for TTM to keep track of this, but in 
order to do so it needs a driver callback that
can order two fences, and can order a job in the current ring w r t a 
fence. In radeon's case that driver callback
would probably insert a barrier / semaphore. In the case of simpler 
hardware it would wait on one of the fences.

/Thomas

>
> Cheers,
> Jerome