Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]

Fri Nov 30 13:36:01 PST 2012

On 11/30/2012 10:07 PM, Jerome Glisse wrote:
> On Fri, Nov 30, 2012 at 09:35:49PM +0100, Thomas Hellstrom wrote:
>> On 11/30/2012 08:25 PM, Jerome Glisse wrote:
>>> On Fri, Nov 30, 2012 at 07:31:04PM +0100, Thomas Hellstrom wrote:
>>>> On 11/30/2012 07:07 PM, Jerome Glisse wrote:
>>>>> On Fri, Nov 30, 2012 at 06:43:36PM +0100, Thomas Hellstrom wrote:
>>>>>> On 11/30/2012 06:18 PM, Jerome Glisse wrote:
>>>>>>> On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom <thomas at shipmail.org> wrote:
>>>>>>>> On 11/30/2012 05:30 PM, Jerome Glisse wrote:
>>>>>>>>> On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom <thomas at shipmail.org>
>>>>>>>>> wrote:
>>>>>>>>>> On 11/29/2012 10:58 PM, Marek Olšák wrote:
>>>>>>>>>>> What I tried to point out was that the synchronization shouldn't be
>>>>>>>>>>> needed, because the CPU shouldn't do anything with the contents of
>>>>>>>>>>> evicted buffers. The GPU moves the buffers, not the CPU. What does the
>>>>>>>>>>> CPU do besides updating some kernel structures?
>>>>>>>>>>>
>>>>>>>>>>> Also, buffer deletion is something where you don't need to wait for
>>>>>>>>>>> the buffer to become idle if you know the memory area won't be
>>>>>>>>>>> mapped by the CPU, ever. The memory can be reclaimed right away. It
>>>>>>>>>>> would be the GPU to move new data in and once that happens, the old
>>>>>>>>>>> buffer will be trivially idle, because single-ring GPUs execute
>>>>>>>>>>> commands in order.
>>>>>>>>>>>
>>>>>>>>>>> Marek
>>>>>>>>>> Actually asynchronous eviction / deletion is something I have been
>>>>>>>>>> prototyping for a while but never gotten around to implement in TTM:
>>>>>>>>>>
>>>>>>>>>> There are a few minor caveats:
>>>>>>>>>>
>>>>>>>>>> With buffer deletion, what you say is true for fixed memory, but not for
>>>>>>>>>> TT
>>>>>>>>>> memory where pages are reclaimed by the system after buffer destruction.
>>>>>>>>>> That means that we don't have to wait for idle to free GPU space, but we
>>>>>>>>>> need to wait before pages are handed back to the system.
>>>>>>>>>>
>>>>>>>>>> Swapout needs to access the contents of evicted buffers, but
>>>>>>>>>> synchronizing
>>>>>>>>>> doesn't need to happen until just before swapout.
>>>>>>>>>>
>>>>>>>>>> Multi-ring - CPU support: If another ring / engine or the CPU is about to
>>>>>>>>>> move in buffer contents to VRAM or a GPU aperture that was previously
>>>>>>>>>> evicted by another ring, it needs to sync with that eviction, but doesn't
>>>>>>>>>> know what buffer or even which buffers occupied the space previously.
>>>>>>>>>> Trivially one can attach a sync object to the memory type manager that
>>>>>>>>>> represents the last eviction from that memory type, and *any* engine (CPU
>>>>>>>>>> or
>>>>>>>>>> GPU) that moves buffer contents in needs to order that movement with
>>>>>>>>>> respect
>>>>>>>>>> to that fence. As you say, with a single ring and no CPU fallbacks, that
>>>>>>>>>> ordering is a no-op, but any common (non-driver based) implementation
>>>>>>>>>> needs
>>>>>>>>>> to support this.
>>>>>>>>>>
>>>>>>>>>> A single fence attached to the memory type manager is the simplest
>>>>>>>>>> solution,
>>>>>>>>>> but a solution with a fence for each free region in the free list is also
>>>>>>>>>> possible. Then TTM needs a driver callback to be able order fences w r t
>>>>>>>>>> echother.
>>>>>>>>>>
>>>>>>>>>> /Thomas
>>>>>>>>>>
>>>>>>>>> Radeon already handle multi-ring and ttm interaction with what we call
>>>>>>>>> semaphore. Semaphore are created to synchronize with fence accross
>>>>>>>>> different ring. I think the easiest solution is to just remove the bo
>>>>>>>>> wait in ttm and let driver handle this.
>>>>>>>> The wait can be removed, but only conditioned on a driver flag that says it
>>>>>>>> supports unsynchronous buffer moves.
>>>>>>>>
>>>>>>>> The multi-ring case I'm talking about is:
>>>>>>>>
>>>>>>>> Ring 1 evicts buffer A, emits fence 0
>>>>>>>> Ring 2 evicts buffer B, emits fence 1
>>>>>>>> ..Other eviction takes place by various rings, perhaps including ring 1 and
>>>>>>>> ring 2.
>>>>>>>> Ring 3 moves buffer C into the space which happens bo be the union of the
>>>>>>>> space prevously occupied buffer A and buffer B.
>>>>>>>>
>>>>>>>> Question is: which fence do you want to order this move with?
>>>>>>>> The answer is whichever of fence 0 and 1 signals last.
>>>>>>>>
>>>>>>>> I think it's a reasonable thing for TTM to keep track of this, but in order
>>>>>>>> to do so it needs a driver callback that
>>>>>>>> can order two fences, and can order a job in the current ring w r t a fence.
>>>>>>>> In radeon's case that driver callback
>>>>>>>> would probably insert a barrier / semaphore. In the case of simpler hardware
>>>>>>>> it would wait on one of the fences.
>>>>>>>>
>>>>>>>> /Thomas
>>>>>>>>
>>>>>>> I don't think we can order fence easily with a clean api, i would
>>>>>>> rather see ttm provide a list of fence to driver and tell to the
>>>>>>> driver before moving this object all the fence on this list need to be
>>>>>>> completed. I think it's as easy as associating fence with drm_mm (well
>>>>>>> nouveau as its own mm stuff) but idea would basicly be that fence are
>>>>>>> both associated with bo and with mm object so you know when a segment
>>>>>>> of memory is idle/available for use.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Jerome
>>>>>> Hmm. Agreed that would save a lot of barriers.
>>>>>>
>>>>>> Even if TTM tracks fences by free mm regions or a single fence for
>>>>>> the whole memory type, it's a simple fact that fences from the same
>>>>>> ring are trivially ordered, which means such a list should contain at
>>>>>> most as many fences as there are rings.
>>>>> Yes, one function callback is needed to know which fence is necessary,
>>>>> also ttm needs to know the number of rings (note that i think newer
>>>>> hw will have somethings like 1024 rings or even more, even today hw
>>>>> might have as many as i think nvidia channel is pretty much what i
>>>>> define to be a ring).
>>>>>
>>>>> But i think most case will be few fence accross few rings. Like 1
>>>>> ring is the dma ring and then you have a ring for one of the GL
>>>>> context that using the memory and another ring for the new context
>>>>> that want to use the memory.
>>>>>
>>>>>> So, whatever approach is chosen, TTM needs to be able to determine
>>>>>> that trivial ordering, and I think the upcoming cross-device fencing
>>>>>> work will face the exact same problem.
>>>>>>
>>>>>> My proposed ordering API would look something like
>>>>>>
>>>>>> struct fence *order_fences(struct fence *fence_a, struct fence
>>>>>> *fence_b, bool trivial_order, bool interruptible, bool no_wait_gpu)
>>>>>>
>>>>>> Returns which of the fences @fence_a and @fence_b that when
>>>>>> signaled, guarantees that also the other
>>>>>> fence has signaled. If @quick_order is true, and the driver cannot
>>>>>> trivially order the fences, it may return ERR_PTR(-EAGAIN),
>>>>>> if @interruptible is true, any wait should be performed
>>>>>> interruptibly and if no_wait_gpu is true, the function is not
>>>>>> allowed to wait on gpu but should issue ERR_PTR(-EBUSY) if it needs
>>>>>> to do so to order fences.
>>>>>>
>>>>>> (Hardware without semaphores can't order fences without waiting on them).
>>>>>>
>>>>>> The list approach you suggest would use @trivial_order = true,
>>>>>> Single fence approach would use @trivial_order = false.
>>>>>>
>>>>>> And a first simple implementation in TTM would perhaps use your list
>>>>>> approach with a single list for the whole memory type.
>>>>>>
>>>>>> /Thomas
>>>>> I would rather add a callback like :
>>>>>
>>>>> ttm_reduce_fences(unsigned *nfences, fence **fencearray)
>>>> I don't agree here. I think the fence order function is more
>>>> versatile and a good abstraction that can be
>>>> applied in a number of cases to this problem. Anyway we should sync
>>>> this with Maarten and his
>>>> fence work. The same problem applies to attaching shared fences to a bo.
>>> Radeon already handle the bo case on multi-ring and i think it should
>>> be left to the driver to do what its necessary there.
>>>
>>> I don't think here more versatility is bad, in fact i am pretty sure
>>> that no hw can give fence ordering and i also think that if driver have
>>> to track multi-ring fence ordering it will just waste resource for no
>>> good reasons. For tracking that in radeon i will need to keep a list
>>> of semaphore and knows which semaphores insure synchronization btw
>>> which ring and for each which fence are concerned just thinking to it
>>> it would be a messy graph with tons of nodes.
>>>
>>> Of course here i am thinking in term of newer GPU with tons of rings,
>>> GPU with one ring, which are a vanishing category as newer opencl
>>> require more ring, are easy to handle but i really don't think we
>>> should design for those.
>> The biggest problem I see with ttm_reduce_fences() is that it seems
>> to have high complexity, since it
>> doesn't know which fence is the new one. And if it did, it would
>> only be a multi-fence version of
>> order_fences(trivial=true).
> I am sure all driver with multi-ring will store ring id in there fence
> structure, that's from my pov a requirement. So when you get a list of
> fence you first go over fence on the same ring and so far all fence
> implementation use increasing sequence number for fence. So reducing
> becomes as easy as only leaving the most recent fence for each of the
> ring with an active fence. At the same time it could check (assuming
> it's a quick operation) if fence is already signaled or not.
>
> For radeon this function would be very small, a simple imbricated loop
> with couple test in the loop.

What you describe sounds like an O(n²) complexity algorithm, whereas
an algorithm based on order_fences is O(n) complexity.

Thanks, Thomas