[Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete
glisse at freedesktop.org
Mon Jan 25 00:14:44 PST 2010
On Thu, Jan 21, 2010 at 01:59:26PM +0100, Thomas Hellstrom wrote:
> Jerome Glisse wrote:
> >On Thu, Jan 21, 2010 at 04:49:39AM +0100, Luca Barbieri wrote:
> >>>We had to do a similar thing in the
> >>>Poulsbo driver and it turned out that we could save a significant amount of
> >>>CPU by using a delayed workqueue, collecting objects and destroying them
> >>Yes, indeed, we don't really care about a fence expiring unless we
> >>want to use that buffer or we are out of memory.
> >>Processing each fence, especially if there is an interrupt per fence,
> >>seems unnecessary, since you can just do that work in the cases
> >>mentioned above.
> >>So, here is a new proposal that should have the best features of all
> >>previously considered methods.
> >>Assume first that there is a single FIFO channel.
> >>We introduce a list of fence nodes, each containing:
> >>- A list of buffers NOT to be destroyed having the fence as their sync
> >>object (or possibly one for each memory type)
> >>- A delayed-destroy list of buffers having the fence as their sync object
> >>Fence nodes are added at the end upon emission.
> >>They are removed and freed when the buffer count drops to 0 (meaning
> >>all buffers have been moved to later fences).
> >>Thus, the number of fence nodes does not grow without bounds, but is
> >>bounded by the number of buffers.
> >>The LRU lists are changed to only contain unfenced buffers and buffers
> >>are initially put on it.
> >>When a buffer is fenced, it is removed from the LRU list or its
> >>current fence list, and added to the new fence (possibly destroying
> >>the old fence node in the process).
> >>The reference count of buffer objects is only increased when they are
> >>put in a delayed destruction list.
> >>Upon deletion, they are destroyed immediately if they are on the LRU
> >>list and moved to the corresponding delayed-delete list if they are in
> >>the fenced list.
> >>ttm_bo_delayed_delete is no longer called by any workqueue, but only
> >>on when we run out of memory, before we try eviction.
> >>It processes the list until the first unsignalled fence, destroying
> >>all buffers it finds and freeing all the fence nodes.
> >>All the buffers in the alive lists are put in the LRU list, in the
> >>correct order.
> >>We may either keep an alive list per memory type, or sort the buffers
> >>by memory type (so they are put in the memory type LRU list) at this
> >>Thus, after ttm_bo_delayed_delete, there is the following scenario:
> >>1. Inactive buffers with no references are destroyed
> >>2. Inactive buffers with references are in the LRU list
> >>3. Fenced buffers with no references are in the delayed-destroy list
> >>attached to their fence
> >>4. Fenced buffers with references are in the alive list attached to their fence
> >>This should have about the same CPU and memory usage of the current
> >>code, but the following advantages:
> >>1. Delayed deletion does not run periodically, but exactly when needed
> >>(at allocation time, before eviction)
> >>2. Delayed deletion always deletes all buffers that can be deleted,
> >>since it processes them in fence order
> >>3. The LRU list at eviction time contains exactly all evictable buffers
> >>4. Each buffer is always on an LRU _or_ on a delayed destroy list, so
> >>only one node is necessary
> >>5. Once buffer deletion determines a fence is signalled, it can
> >>destroyed all its to-be-destroyed buffers without any more checking
> >>6. Some fence logic (such as signalling of all fences on channel
> >>forced closing) can be moved from drivers to TTM
> >>7. Some channel logic can be moved from drivers to TTM
> >>8. The code should be simpler and more elegant
> >>Note that using a buffer with the CPU doesn't change its position in
> >>the LRU list.
> >>This is good, because evicting a buffer used by the CPU is actually a
> >>good thing, since it will move to a region of memory slower for the
> >>GPU but possibly faster for the CPU.
> >>However, for buffers in system memory, if the LRU list is used for
> >>swapping, CPU access should move the buffer to the front of list
> >>(using the LRU list for this is probably a bad idea though, since
> >>system memory swapping should be at page granularity).
> >>For multiple channels, things are slightly more complex.
> >>First, we need to built the fence data structure for each channel.
> >>Also, we need the fence/buffer nodes to be separate
> >>Delayed destruction continues to work, but the reference count needs
> >>to be set to the number of channels fencing the buffer on destruction.
> >>Putting buffers in the LRU list can be done by doing n-way merging
> >>between the channel fence lists, assigning each fence a global
> >>sequence number, and using that to merge and put back buffers in the
> >>n-way merging can be done with a small heap on the stack on current
> >>hardware where channels are limited.
> >>Furthermore, we need to keep a reference count, so that buffers are
> >>put in the LRU (or destroyed) only after they are off all the
> >>The LRU list semantics are relaxed. If a buffer has both its fence
> >>emitted before another buffer, and also signaled before it, then it
> >>will be considered as "less recently" used, and the opposite thing
> >>happens if both are after. Otherwise, the order is undetermined.
> >>This should still guarantee good eviction behavior, and allows to have
> >>the LRU list only contain evictable buffers, while not requiring to
> >>use a dynamic priority queue.
> >>Otherwise, a red-black tree (or an indirect heap with back references)
> >>indexed by the global fence sequence number can guarantee strict
> >>emission-time LRU semantics. This seems unnecessary though.
> >>Thoughts? Opinions?
> >>Any possible improvement?
> >I love the first solution but solution to multi fifo side
> >looks overkilling. So i am wondering if we aren't tackling
> >the issue from the wrong side. Below is slightly different
> >approach to the issue.
> >Idea is that memory management is about having bo in place
> >for the GPU. So instead of looking at the problem from the
> >bo side, look at it from the GPU address space side.
> >For simplicity here i am using only one space VRAM (but GTT
> >or others can be deal the sameway). So now we split the
> >problem to look at it as a transaction any time the driver
> >want to do somethings it ask to ttm to create new transaction
> >each transaction is associated to a fence and list of buffer
> >with their possible placement (i will explain how it can
> >fit multi fifo gpu well too and how it also allow easy support
> >to GPU with virtual address space which is somethings slowly
> >becoming a reality i think).
> >TTM receive a transaction it takes a transaction lock and give
> >transaction number to this new transaction. Then it goes over
> >the BO list and try to come up with the best placement for
> >each of them.
> Jerome, This seems to have a fundamental flaw of not allowing
> simultaneous command submission.
> With the current code, A process blocked trying to validate its
> buffers will not block other processes from validating, and we
> should keep it that way.
Idea is who comes first is first to be serve :) and it can deal
with multiple command submission if the hw can deal with it
(sync stuff i am briefly describing in multi fifo part). But
yes if process A is first in the kernel and process B could
have been executed right away, A will win and have right to
evict others, i don't think it will hurt perf, i think most
of the usecase we will have enough memory and better userspace
driver that the eviction from vram should become rare enough
to be noticeable.
More information about the Nouveau