[Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

Thomas Hellstrom thomas at shipmail.org
Thu Jan 21 04:53:28 PST 2010


Luca Barbieri wrote:
>> We had to do a similar thing in the
>> Poulsbo driver and it turned out that we could save a significant amount of
>> CPU by using a delayed workqueue, collecting objects and destroying them
>> periodically.
>>     
> Yes, indeed, we don't really care about a fence expiring unless we
> want to use that buffer or we are out of memory.
> Processing each fence, especially if there is an interrupt per fence,
> seems unnecessary, since you can just do that work in the cases
> mentioned above.
>
> So, here is a new proposal that should have the best features of all
> previously considered methods.
> Assume first that there is a single FIFO channel.
>
> We introduce a list of fence nodes, each containing:
> - A list of buffers NOT to be destroyed having the fence as their sync
> object (or possibly one for each memory type)
> - A delayed-destroy list of buffers having the fence as their sync object
>
> Fence nodes are added at the end upon emission.
> They are removed and freed when the buffer count drops to 0 (meaning
> all buffers have been moved to later fences).
> Thus, the number of fence nodes does not grow without bounds, but is
> bounded by the number of buffers.
> .
> The LRU lists are changed to only contain unfenced buffers and buffers
> are initially put on it.
> When a buffer is fenced, it is removed from the LRU list or its
> current fence list, and added to the new fence (possibly destroying
> the old fence node in the process).
> The reference count of buffer objects is only increased when they are
> put in a delayed destruction list.
>
> Upon deletion, they are destroyed immediately if they are on the LRU
> list and moved to the corresponding delayed-delete list if they are in
> the fenced list.
>
> ttm_bo_delayed_delete is no longer called by any workqueue, but only
> on when we run out of memory, before we try eviction.
> It processes the list until the first unsignalled fence, destroying
> all buffers it finds and freeing all the fence nodes.
> All the buffers in the alive lists are put in the LRU list, in the
> correct order.
> We may either keep an alive list per memory type, or sort the buffers
> by memory type (so they are put in the memory type LRU list) at this
> point
>
> Thus, after ttm_bo_delayed_delete, there is the following scenario:
> 1. Inactive buffers with no references are destroyed
> 2. Inactive buffers with references are in the LRU list
> 3. Fenced buffers with no references are in the delayed-destroy list
> attached to their fence
> 4. Fenced buffers with references are in the alive list attached to their fence
>
> This should have about the same CPU and memory usage of the current
> code, but the following advantages:
> 1. Delayed deletion does not run periodically, but exactly when needed
> (at allocation time, before eviction)
> 2. Delayed deletion always deletes all buffers that can be deleted,
> since it processes them in fence order
> 3. The LRU list at eviction time contains exactly all evictable buffers
> 4. Each buffer is always on an LRU _or_ on a delayed destroy list, so
> only one node is necessary
> 5. Once buffer deletion determines a fence is signalled, it can
> destroyed all its to-be-destroyed buffers without any more checking
> 6. Some fence logic (such as signalling of all fences on channel
> forced closing) can be moved  from drivers to TTM
> 7. Some channel logic can be moved from drivers to TTM
> 8. The code should be simpler and more elegant
>
> Note that using a buffer with the CPU doesn't change its position in
> the LRU list.
> This is good, because evicting a buffer used by the CPU is actually a
> good thing, since it will move to a region of memory slower for the
> GPU but possibly faster for the CPU.
> However, for buffers in system memory, if the LRU list is used for
> swapping, CPU access should move the buffer to the front of list
> (using the LRU list for this is probably a bad idea though, since
> system memory swapping should be at page granularity).
>
> For multiple channels, things are slightly more complex.
> First, we need to built the fence data structure for each channel.
> Also, we need the fence/buffer nodes to be separate
> Delayed destruction continues to work, but the reference count needs
> to be set to the number of channels fencing the buffer on destruction.
> Putting buffers in the LRU list can be done by doing n-way merging
> between the channel fence lists, assigning each fence a global
> sequence number, and using that to merge and put back buffers in the
> LRU.
> n-way merging can be done with a small heap on the stack on current
> hardware where channels are limited.
> Furthermore, we need to keep a reference count, so that buffers are
> put in the LRU (or destroyed) only after they are off all the
> channels.
>
> The LRU list semantics are relaxed. If a buffer has both its fence
> emitted before another buffer, and also signaled before it, then it
> will be considered as "less recently" used, and the opposite thing
> happens if both are after. Otherwise, the order is undetermined.
>
> This should still guarantee good eviction behavior, and allows to have
> the LRU list only contain evictable buffers, while not requiring to
> use a dynamic priority queue.
> Otherwise, a red-black tree (or an indirect heap with back references)
> indexed by the global fence sequence number can guarantee strict
> emission-time LRU semantics. This seems unnecessary though.
>
> Thoughts? Opinions?
> Any possible improvement?
>   
Hi, Luca.

I have some comments, but I'm heading off for vacation and back in a week.

At a first glance:

1) We probably *will* need a delayed destroyed workqueue to avoid 
wasting memory that otherwise should be freed to the system. At the very 
least, the delayed delete process should optionally be run by a system 
shrinker.

2) Fences in TTM are currently not necessarily strictly ordered, and 
sequence numbers are hidden from the bo code. This means, for a given 
FIFO, fence sequence 3 may expire before fence sequence 2, depending on 
the usage of the buffer. The old Poulsbo and the experimental openChrome 
drivers are using this feature, and also the long deprecated intel TTM 
implementation. The reason for this is that the same FIFO may hand 
commands to different HW enigines, and the engines execute in parallel. 
With the asynchronous memory management I've discussed earlier, there 
will be a driver callback to order two fences from the same FIFO, but 
that callback will have to either idle one of the engines, flush the GPU 
caches or insert some kind of syncing primitive into the command stream. 
If we need to order all fences, we will do unnecessary idling / flushing 
/ syncing.

I'll comment a bit more when I get back.

/Thomas



More information about the Nouveau mailing list