[Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

Wed Jan 20 19:49:39 PST 2010

> We had to do a similar thing in the
> Poulsbo driver and it turned out that we could save a significant amount of
> CPU by using a delayed workqueue, collecting objects and destroying them
> periodically.
Yes, indeed, we don't really care about a fence expiring unless we
want to use that buffer or we are out of memory.
Processing each fence, especially if there is an interrupt per fence,
seems unnecessary, since you can just do that work in the cases
mentioned above.

So, here is a new proposal that should have the best features of all
previously considered methods.
Assume first that there is a single FIFO channel.

We introduce a list of fence nodes, each containing:
- A list of buffers NOT to be destroyed having the fence as their sync
object (or possibly one for each memory type)
- A delayed-destroy list of buffers having the fence as their sync object

Fence nodes are added at the end upon emission.
They are removed and freed when the buffer count drops to 0 (meaning
all buffers have been moved to later fences).
Thus, the number of fence nodes does not grow without bounds, but is
bounded by the number of buffers.
.
The LRU lists are changed to only contain unfenced buffers and buffers
are initially put on it.
When a buffer is fenced, it is removed from the LRU list or its
current fence list, and added to the new fence (possibly destroying
the old fence node in the process).
The reference count of buffer objects is only increased when they are
put in a delayed destruction list.

Upon deletion, they are destroyed immediately if they are on the LRU
list and moved to the corresponding delayed-delete list if they are in
the fenced list.

ttm_bo_delayed_delete is no longer called by any workqueue, but only
on when we run out of memory, before we try eviction.
It processes the list until the first unsignalled fence, destroying
all buffers it finds and freeing all the fence nodes.
All the buffers in the alive lists are put in the LRU list, in the
correct order.
We may either keep an alive list per memory type, or sort the buffers
by memory type (so they are put in the memory type LRU list) at this
point

Thus, after ttm_bo_delayed_delete, there is the following scenario:
1. Inactive buffers with no references are destroyed
2. Inactive buffers with references are in the LRU list
3. Fenced buffers with no references are in the delayed-destroy list
attached to their fence
4. Fenced buffers with references are in the alive list attached to their fence

This should have about the same CPU and memory usage of the current
code, but the following advantages:
1. Delayed deletion does not run periodically, but exactly when needed
(at allocation time, before eviction)
2. Delayed deletion always deletes all buffers that can be deleted,
since it processes them in fence order
3. The LRU list at eviction time contains exactly all evictable buffers
4. Each buffer is always on an LRU _or_ on a delayed destroy list, so
only one node is necessary
5. Once buffer deletion determines a fence is signalled, it can
destroyed all its to-be-destroyed buffers without any more checking
6. Some fence logic (such as signalling of all fences on channel
forced closing) can be moved  from drivers to TTM
7. Some channel logic can be moved from drivers to TTM
8. The code should be simpler and more elegant

Note that using a buffer with the CPU doesn't change its position in
the LRU list.
This is good, because evicting a buffer used by the CPU is actually a
good thing, since it will move to a region of memory slower for the
GPU but possibly faster for the CPU.
However, for buffers in system memory, if the LRU list is used for
swapping, CPU access should move the buffer to the front of list
(using the LRU list for this is probably a bad idea though, since
system memory swapping should be at page granularity).

For multiple channels, things are slightly more complex.
First, we need to built the fence data structure for each channel.
Also, we need the fence/buffer nodes to be separate
Delayed destruction continues to work, but the reference count needs
to be set to the number of channels fencing the buffer on destruction.
Putting buffers in the LRU list can be done by doing n-way merging
between the channel fence lists, assigning each fence a global
sequence number, and using that to merge and put back buffers in the
LRU.
n-way merging can be done with a small heap on the stack on current
hardware where channels are limited.
Furthermore, we need to keep a reference count, so that buffers are
put in the LRU (or destroyed) only after they are off all the
channels.

The LRU list semantics are relaxed. If a buffer has both its fence
emitted before another buffer, and also signaled before it, then it
will be considered as "less recently" used, and the opposite thing
happens if both are after. Otherwise, the order is undetermined.

This should still guarantee good eviction behavior, and allows to have
the LRU list only contain evictable buffers, while not requiring to
use a dynamic priority queue.
Otherwise, a red-black tree (or an indirect heap with back references)
indexed by the global fence sequence number can guarantee strict
emission-time LRU semantics. This seems unnecessary though.

Thoughts? Opinions?
Any possible improvement?