[Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

Thu Jan 21 04:59:26 PST 2010

Jerome Glisse wrote:
> On Thu, Jan 21, 2010 at 04:49:39AM +0100, Luca Barbieri wrote:
>   
>>> We had to do a similar thing in the
>>> Poulsbo driver and it turned out that we could save a significant amount of
>>> CPU by using a delayed workqueue, collecting objects and destroying them
>>> periodically.
>>>       
>> Yes, indeed, we don't really care about a fence expiring unless we
>> want to use that buffer or we are out of memory.
>> Processing each fence, especially if there is an interrupt per fence,
>> seems unnecessary, since you can just do that work in the cases
>> mentioned above.
>>
>> So, here is a new proposal that should have the best features of all
>> previously considered methods.
>> Assume first that there is a single FIFO channel.
>>
>> We introduce a list of fence nodes, each containing:
>> - A list of buffers NOT to be destroyed having the fence as their sync
>> object (or possibly one for each memory type)
>> - A delayed-destroy list of buffers having the fence as their sync object
>>
>> Fence nodes are added at the end upon emission.
>> They are removed and freed when the buffer count drops to 0 (meaning
>> all buffers have been moved to later fences).
>> Thus, the number of fence nodes does not grow without bounds, but is
>> bounded by the number of buffers.
>> .
>> The LRU lists are changed to only contain unfenced buffers and buffers
>> are initially put on it.
>> When a buffer is fenced, it is removed from the LRU list or its
>> current fence list, and added to the new fence (possibly destroying
>> the old fence node in the process).
>> The reference count of buffer objects is only increased when they are
>> put in a delayed destruction list.
>>
>> Upon deletion, they are destroyed immediately if they are on the LRU
>> list and moved to the corresponding delayed-delete list if they are in
>> the fenced list.
>>
>> ttm_bo_delayed_delete is no longer called by any workqueue, but only
>> on when we run out of memory, before we try eviction.
>> It processes the list until the first unsignalled fence, destroying
>> all buffers it finds and freeing all the fence nodes.
>> All the buffers in the alive lists are put in the LRU list, in the
>> correct order.
>> We may either keep an alive list per memory type, or sort the buffers
>> by memory type (so they are put in the memory type LRU list) at this
>> point
>>
>> Thus, after ttm_bo_delayed_delete, there is the following scenario:
>> 1. Inactive buffers with no references are destroyed
>> 2. Inactive buffers with references are in the LRU list
>> 3. Fenced buffers with no references are in the delayed-destroy list
>> attached to their fence
>> 4. Fenced buffers with references are in the alive list attached to their fence
>>
>> This should have about the same CPU and memory usage of the current
>> code, but the following advantages:
>> 1. Delayed deletion does not run periodically, but exactly when needed
>> (at allocation time, before eviction)
>> 2. Delayed deletion always deletes all buffers that can be deleted,
>> since it processes them in fence order
>> 3. The LRU list at eviction time contains exactly all evictable buffers
>> 4. Each buffer is always on an LRU _or_ on a delayed destroy list, so
>> only one node is necessary
>> 5. Once buffer deletion determines a fence is signalled, it can
>> destroyed all its to-be-destroyed buffers without any more checking
>> 6. Some fence logic (such as signalling of all fences on channel
>> forced closing) can be moved  from drivers to TTM
>> 7. Some channel logic can be moved from drivers to TTM
>> 8. The code should be simpler and more elegant
>>
>> Note that using a buffer with the CPU doesn't change its position in
>> the LRU list.
>> This is good, because evicting a buffer used by the CPU is actually a
>> good thing, since it will move to a region of memory slower for the
>> GPU but possibly faster for the CPU.
>> However, for buffers in system memory, if the LRU list is used for
>> swapping, CPU access should move the buffer to the front of list
>> (using the LRU list for this is probably a bad idea though, since
>> system memory swapping should be at page granularity).
>>
>> For multiple channels, things are slightly more complex.
>> First, we need to built the fence data structure for each channel.
>> Also, we need the fence/buffer nodes to be separate
>> Delayed destruction continues to work, but the reference count needs
>> to be set to the number of channels fencing the buffer on destruction.
>> Putting buffers in the LRU list can be done by doing n-way merging
>> between the channel fence lists, assigning each fence a global
>> sequence number, and using that to merge and put back buffers in the
>> LRU.
>> n-way merging can be done with a small heap on the stack on current
>> hardware where channels are limited.
>> Furthermore, we need to keep a reference count, so that buffers are
>> put in the LRU (or destroyed) only after they are off all the
>> channels.
>>
>> The LRU list semantics are relaxed. If a buffer has both its fence
>> emitted before another buffer, and also signaled before it, then it
>> will be considered as "less recently" used, and the opposite thing
>> happens if both are after. Otherwise, the order is undetermined.
>>
>> This should still guarantee good eviction behavior, and allows to have
>> the LRU list only contain evictable buffers, while not requiring to
>> use a dynamic priority queue.
>> Otherwise, a red-black tree (or an indirect heap with back references)
>> indexed by the global fence sequence number can guarantee strict
>> emission-time LRU semantics. This seems unnecessary though.
>>
>> Thoughts? Opinions?
>> Any possible improvement?
>>
>>     
>
> I love the first solution but solution to multi fifo side
> looks overkilling. So i am wondering if we aren't tackling
> the issue from the wrong side. Below is slightly different
> approach to the issue.
>
> Idea is that memory management is about having bo in place
> for the GPU. So instead of looking at the problem from the
> bo side, look at it from the GPU address space side.
>
> For simplicity here i am using only one space VRAM (but GTT
> or others can be deal the sameway). So now we split the
> problem to look at it as a transaction any time the driver
> want to do somethings it ask to ttm to create new transaction
> each transaction is associated to a fence and list of buffer
> with their possible placement (i will explain how it can
> fit multi fifo gpu well too and how it also allow easy support
> to GPU with virtual address space which is somethings slowly
> becoming a reality i think).
>
> TTM receive a transaction it takes a transaction lock and give
> transaction number to this new transaction. Then it goes over
> the BO list and try to come up with the best placement for
> each of them.
>   
Jerome, This seems to have a fundamental flaw of not allowing
simultaneous command submission.

With the current code, A process blocked trying to validate its buffers 
will not block other processes from validating, and we should keep it 
that way.

/Thomas