Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]

Fri Nov 30 10:31:04 PST 2012

On 11/30/2012 07:07 PM, Jerome Glisse wrote:
> On Fri, Nov 30, 2012 at 06:43:36PM +0100, Thomas Hellstrom wrote:
>> On 11/30/2012 06:18 PM, Jerome Glisse wrote:
>>> On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom <thomas at shipmail.org> wrote:
>>>> On 11/30/2012 05:30 PM, Jerome Glisse wrote:
>>>>> On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom <thomas at shipmail.org>
>>>>> wrote:
>>>>>> On 11/29/2012 10:58 PM, Marek Olšák wrote:
>>>>>>> What I tried to point out was that the synchronization shouldn't be
>>>>>>> needed, because the CPU shouldn't do anything with the contents of
>>>>>>> evicted buffers. The GPU moves the buffers, not the CPU. What does the
>>>>>>> CPU do besides updating some kernel structures?
>>>>>>>
>>>>>>> Also, buffer deletion is something where you don't need to wait for
>>>>>>> the buffer to become idle if you know the memory area won't be
>>>>>>> mapped by the CPU, ever. The memory can be reclaimed right away. It
>>>>>>> would be the GPU to move new data in and once that happens, the old
>>>>>>> buffer will be trivially idle, because single-ring GPUs execute
>>>>>>> commands in order.
>>>>>>>
>>>>>>> Marek
>>>>>> Actually asynchronous eviction / deletion is something I have been
>>>>>> prototyping for a while but never gotten around to implement in TTM:
>>>>>>
>>>>>> There are a few minor caveats:
>>>>>>
>>>>>> With buffer deletion, what you say is true for fixed memory, but not for
>>>>>> TT
>>>>>> memory where pages are reclaimed by the system after buffer destruction.
>>>>>> That means that we don't have to wait for idle to free GPU space, but we
>>>>>> need to wait before pages are handed back to the system.
>>>>>>
>>>>>> Swapout needs to access the contents of evicted buffers, but
>>>>>> synchronizing
>>>>>> doesn't need to happen until just before swapout.
>>>>>>
>>>>>> Multi-ring - CPU support: If another ring / engine or the CPU is about to
>>>>>> move in buffer contents to VRAM or a GPU aperture that was previously
>>>>>> evicted by another ring, it needs to sync with that eviction, but doesn't
>>>>>> know what buffer or even which buffers occupied the space previously.
>>>>>> Trivially one can attach a sync object to the memory type manager that
>>>>>> represents the last eviction from that memory type, and *any* engine (CPU
>>>>>> or
>>>>>> GPU) that moves buffer contents in needs to order that movement with
>>>>>> respect
>>>>>> to that fence. As you say, with a single ring and no CPU fallbacks, that
>>>>>> ordering is a no-op, but any common (non-driver based) implementation
>>>>>> needs
>>>>>> to support this.
>>>>>>
>>>>>> A single fence attached to the memory type manager is the simplest
>>>>>> solution,
>>>>>> but a solution with a fence for each free region in the free list is also
>>>>>> possible. Then TTM needs a driver callback to be able order fences w r t
>>>>>> echother.
>>>>>>
>>>>>> /Thomas
>>>>>>
>>>>> Radeon already handle multi-ring and ttm interaction with what we call
>>>>> semaphore. Semaphore are created to synchronize with fence accross
>>>>> different ring. I think the easiest solution is to just remove the bo
>>>>> wait in ttm and let driver handle this.
>>>> The wait can be removed, but only conditioned on a driver flag that says it
>>>> supports unsynchronous buffer moves.
>>>>
>>>> The multi-ring case I'm talking about is:
>>>>
>>>> Ring 1 evicts buffer A, emits fence 0
>>>> Ring 2 evicts buffer B, emits fence 1
>>>> ..Other eviction takes place by various rings, perhaps including ring 1 and
>>>> ring 2.
>>>> Ring 3 moves buffer C into the space which happens bo be the union of the
>>>> space prevously occupied buffer A and buffer B.
>>>>
>>>> Question is: which fence do you want to order this move with?
>>>> The answer is whichever of fence 0 and 1 signals last.
>>>>
>>>> I think it's a reasonable thing for TTM to keep track of this, but in order
>>>> to do so it needs a driver callback that
>>>> can order two fences, and can order a job in the current ring w r t a fence.
>>>> In radeon's case that driver callback
>>>> would probably insert a barrier / semaphore. In the case of simpler hardware
>>>> it would wait on one of the fences.
>>>>
>>>> /Thomas
>>>>
>>> I don't think we can order fence easily with a clean api, i would
>>> rather see ttm provide a list of fence to driver and tell to the
>>> driver before moving this object all the fence on this list need to be
>>> completed. I think it's as easy as associating fence with drm_mm (well
>>> nouveau as its own mm stuff) but idea would basicly be that fence are
>>> both associated with bo and with mm object so you know when a segment
>>> of memory is idle/available for use.
>>>
>>> Cheers,
>>> Jerome
>>
>> Hmm. Agreed that would save a lot of barriers.
>>
>> Even if TTM tracks fences by free mm regions or a single fence for
>> the whole memory type, it's a simple fact that fences from the same
>> ring are trivially ordered, which means such a list should contain at
>> most as many fences as there are rings.
> Yes, one function callback is needed to know which fence is necessary,
> also ttm needs to know the number of rings (note that i think newer
> hw will have somethings like 1024 rings or even more, even today hw
> might have as many as i think nvidia channel is pretty much what i
> define to be a ring).
>
> But i think most case will be few fence accross few rings. Like 1
> ring is the dma ring and then you have a ring for one of the GL
> context that using the memory and another ring for the new context
> that want to use the memory.
>
>> So, whatever approach is chosen, TTM needs to be able to determine
>> that trivial ordering, and I think the upcoming cross-device fencing
>> work will face the exact same problem.
>>
>> My proposed ordering API would look something like
>>
>> struct fence *order_fences(struct fence *fence_a, struct fence
>> *fence_b, bool trivial_order, bool interruptible, bool no_wait_gpu)
>>
>> Returns which of the fences @fence_a and @fence_b that when
>> signaled, guarantees that also the other
>> fence has signaled. If @quick_order is true, and the driver cannot
>> trivially order the fences, it may return ERR_PTR(-EAGAIN),
>> if @interruptible is true, any wait should be performed
>> interruptibly and if no_wait_gpu is true, the function is not
>> allowed to wait on gpu but should issue ERR_PTR(-EBUSY) if it needs
>> to do so to order fences.
>>
>> (Hardware without semaphores can't order fences without waiting on them).
>>
>> The list approach you suggest would use @trivial_order = true,
>> Single fence approach would use @trivial_order = false.
>>
>> And a first simple implementation in TTM would perhaps use your list
>> approach with a single list for the whole memory type.
>>
>> /Thomas
> I would rather add a callback like :
>
> ttm_reduce_fences(unsigned *nfences, fence **fencearray)

I don't agree here. I think the fence order function is more versatile 
and a good abstraction that can be
applied in a number of cases to this problem. Anyway we should sync this 
with Maarten and his
fence work. The same problem applies to attaching shared fences to a bo.

>
> In the ttm bo move callback you provide the list list of mm block (each
> having its array of fence) and the move callback is responsible to
> use what ever mecanism it wants to properly schedule and synchronize the
> move.

Agreed.

>
> One thing i am not sure is should we merge free mm block and merge/reduce
> their fence array or should we provide a list of mm block to the move
> callback. I think here there is tradeoff you probably want to merge small
> mm block up to a certain point but you don't want to merge so much that
> any allocation will have to wait on a zillions fences.

As mentioned previously we can choose the complexity here, but the simplest
approach would be to have a single list for the whole manager.

I think if we don't merge mm blocks immediately when freed, we're going 
to use
up a lot of resources.

/Thomas