Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]

Fri Nov 30 09:43:36 PST 2012

On 11/30/2012 06:18 PM, Jerome Glisse wrote:
> On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom <thomas at shipmail.org> wrote:
>> On 11/30/2012 05:30 PM, Jerome Glisse wrote:
>>> On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom <thomas at shipmail.org>
>>> wrote:
>>>> On 11/29/2012 10:58 PM, Marek Olšák wrote:
>>>>>
>>>>> What I tried to point out was that the synchronization shouldn't be
>>>>> needed, because the CPU shouldn't do anything with the contents of
>>>>> evicted buffers. The GPU moves the buffers, not the CPU. What does the
>>>>> CPU do besides updating some kernel structures?
>>>>>
>>>>> Also, buffer deletion is something where you don't need to wait for
>>>>> the buffer to become idle if you know the memory area won't be
>>>>> mapped by the CPU, ever. The memory can be reclaimed right away. It
>>>>> would be the GPU to move new data in and once that happens, the old
>>>>> buffer will be trivially idle, because single-ring GPUs execute
>>>>> commands in order.
>>>>>
>>>>> Marek
>>>>
>>>> Actually asynchronous eviction / deletion is something I have been
>>>> prototyping for a while but never gotten around to implement in TTM:
>>>>
>>>> There are a few minor caveats:
>>>>
>>>> With buffer deletion, what you say is true for fixed memory, but not for
>>>> TT
>>>> memory where pages are reclaimed by the system after buffer destruction.
>>>> That means that we don't have to wait for idle to free GPU space, but we
>>>> need to wait before pages are handed back to the system.
>>>>
>>>> Swapout needs to access the contents of evicted buffers, but
>>>> synchronizing
>>>> doesn't need to happen until just before swapout.
>>>>
>>>> Multi-ring - CPU support: If another ring / engine or the CPU is about to
>>>> move in buffer contents to VRAM or a GPU aperture that was previously
>>>> evicted by another ring, it needs to sync with that eviction, but doesn't
>>>> know what buffer or even which buffers occupied the space previously.
>>>> Trivially one can attach a sync object to the memory type manager that
>>>> represents the last eviction from that memory type, and *any* engine (CPU
>>>> or
>>>> GPU) that moves buffer contents in needs to order that movement with
>>>> respect
>>>> to that fence. As you say, with a single ring and no CPU fallbacks, that
>>>> ordering is a no-op, but any common (non-driver based) implementation
>>>> needs
>>>> to support this.
>>>>
>>>> A single fence attached to the memory type manager is the simplest
>>>> solution,
>>>> but a solution with a fence for each free region in the free list is also
>>>> possible. Then TTM needs a driver callback to be able order fences w r t
>>>> echother.
>>>>
>>>> /Thomas
>>>>
>>> Radeon already handle multi-ring and ttm interaction with what we call
>>> semaphore. Semaphore are created to synchronize with fence accross
>>> different ring. I think the easiest solution is to just remove the bo
>>> wait in ttm and let driver handle this.
>>
>> The wait can be removed, but only conditioned on a driver flag that says it
>> supports unsynchronous buffer moves.
>>
>> The multi-ring case I'm talking about is:
>>
>> Ring 1 evicts buffer A, emits fence 0
>> Ring 2 evicts buffer B, emits fence 1
>> ..Other eviction takes place by various rings, perhaps including ring 1 and
>> ring 2.
>> Ring 3 moves buffer C into the space which happens bo be the union of the
>> space prevously occupied buffer A and buffer B.
>>
>> Question is: which fence do you want to order this move with?
>> The answer is whichever of fence 0 and 1 signals last.
>>
>> I think it's a reasonable thing for TTM to keep track of this, but in order
>> to do so it needs a driver callback that
>> can order two fences, and can order a job in the current ring w r t a fence.
>> In radeon's case that driver callback
>> would probably insert a barrier / semaphore. In the case of simpler hardware
>> it would wait on one of the fences.
>>
>> /Thomas
>>
> I don't think we can order fence easily with a clean api, i would
> rather see ttm provide a list of fence to driver and tell to the
> driver before moving this object all the fence on this list need to be
> completed. I think it's as easy as associating fence with drm_mm (well
> nouveau as its own mm stuff) but idea would basicly be that fence are
> both associated with bo and with mm object so you know when a segment
> of memory is idle/available for use.
>
> Cheers,
> Jerome

Hmm. Agreed that would save a lot of barriers.

Even if TTM tracks fences by free mm regions or a single fence for the 
whole memory type,
it's a simple fact that fences from the same ring are trivially ordered, 
which means such a list
should contain at most as many fences as there are rings.

So, whatever approach is chosen, TTM needs to be able to determine that 
trivial ordering,
and I think the upcoming cross-device fencing work will face the exact 
same problem.

My proposed ordering API would look something like

struct fence *order_fences(struct fence *fence_a, struct fence *fence_b, 
bool trivial_order, bool interruptible, bool no_wait_gpu)

Returns which of the fences @fence_a and @fence_b that when signaled, 
guarantees that also the other
fence has signaled. If @quick_order is true, and the driver cannot 
trivially order the fences, it may return ERR_PTR(-EAGAIN),
if @interruptible is true, any wait should be performed interruptibly 
and if no_wait_gpu is true, the function is not
allowed to wait on gpu but should issue ERR_PTR(-EBUSY) if it needs to 
do so to order fences.

(Hardware without semaphores can't order fences without waiting on them).

The list approach you suggest would use @trivial_order = true, Single 
fence approach would use @trivial_order = false.

And a first simple implementation in TTM would perhaps use your list 
approach with a single list for the whole memory type.

/Thomas