[Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

Mon Jan 25 00:14:44 PST 2010

On Thu, Jan 21, 2010 at 01:59:26PM +0100, Thomas Hellstrom wrote:
> Jerome Glisse wrote:
> >On Thu, Jan 21, 2010 at 04:49:39AM +0100, Luca Barbieri wrote:
> >>>We had to do a similar thing in the
> >>>Poulsbo driver and it turned out that we could save a significant amount of
> >>>CPU by using a delayed workqueue, collecting objects and destroying them
> >>>periodically.
> >>Yes, indeed, we don't really care about a fence expiring unless we
> >>want to use that buffer or we are out of memory.
> >>Processing each fence, especially if there is an interrupt per fence,
> >>seems unnecessary, since you can just do that work in the cases
> >>mentioned above.
> >>
> >>So, here is a new proposal that should have the best features of all
> >>previously considered methods.
> >>Assume first that there is a single FIFO channel.
> >>
> >>We introduce a list of fence nodes, each containing:
> >>- A list of buffers NOT to be destroyed having the fence as their sync
> >>object (or possibly one for each memory type)
> >>- A delayed-destroy list of buffers having the fence as their sync object
> >>
> >>Fence nodes are added at the end upon emission.
> >>They are removed and freed when the buffer count drops to 0 (meaning
> >>all buffers have been moved to later fences).
> >>Thus, the number of fence nodes does not grow without bounds, but is
> >>bounded by the number of buffers.
> >>.
> >>The LRU lists are changed to only contain unfenced buffers and buffers
> >>are initially put on it.
> >>When a buffer is fenced, it is removed from the LRU list or its
> >>current fence list, and added to the new fence (possibly destroying
> >>the old fence node in the process).
> >>The reference count of buffer objects is only increased when they are
> >>put in a delayed destruction list.
> >>
> >>Upon deletion, they are destroyed immediately if they are on the LRU
> >>list and moved to the corresponding delayed-delete list if they are in
> >>the fenced list.
> >>
> >>ttm_bo_delayed_delete is no longer called by any workqueue, but only
> >>on when we run out of memory, before we try eviction.
> >>It processes the list until the first unsignalled fence, destroying
> >>all buffers it finds and freeing all the fence nodes.
> >>All the buffers in the alive lists are put in the LRU list, in the
> >>correct order.
> >>We may either keep an alive list per memory type, or sort the buffers
> >>by memory type (so they are put in the memory type LRU list) at this
> >>point
> >>
> >>Thus, after ttm_bo_delayed_delete, there is the following scenario:
> >>1. Inactive buffers with no references are destroyed
> >>2. Inactive buffers with references are in the LRU list
> >>3. Fenced buffers with no references are in the delayed-destroy list
> >>attached to their fence
> >>4. Fenced buffers with references are in the alive list attached to their fence
> >>
> >>This should have about the same CPU and memory usage of the current
> >>code, but the following advantages:
> >>1. Delayed deletion does not run periodically, but exactly when needed
> >>(at allocation time, before eviction)
> >>2. Delayed deletion always deletes all buffers that can be deleted,
> >>since it processes them in fence order
> >>3. The LRU list at eviction time contains exactly all evictable buffers
> >>4. Each buffer is always on an LRU _or_ on a delayed destroy list, so
> >>only one node is necessary
> >>5. Once buffer deletion determines a fence is signalled, it can
> >>destroyed all its to-be-destroyed buffers without any more checking
> >>6. Some fence logic (such as signalling of all fences on channel
> >>forced closing) can be moved  from drivers to TTM
> >>7. Some channel logic can be moved from drivers to TTM
> >>8. The code should be simpler and more elegant
> >>
> >>Note that using a buffer with the CPU doesn't change its position in
> >>the LRU list.
> >>This is good, because evicting a buffer used by the CPU is actually a
> >>good thing, since it will move to a region of memory slower for the
> >>GPU but possibly faster for the CPU.
> >>However, for buffers in system memory, if the LRU list is used for
> >>swapping, CPU access should move the buffer to the front of list
> >>(using the LRU list for this is probably a bad idea though, since
> >>system memory swapping should be at page granularity).
> >>
> >>For multiple channels, things are slightly more complex.
> >>First, we need to built the fence data structure for each channel.
> >>Also, we need the fence/buffer nodes to be separate
> >>Delayed destruction continues to work, but the reference count needs
> >>to be set to the number of channels fencing the buffer on destruction.
> >>Putting buffers in the LRU list can be done by doing n-way merging
> >>between the channel fence lists, assigning each fence a global
> >>sequence number, and using that to merge and put back buffers in the
> >>LRU.
> >>n-way merging can be done with a small heap on the stack on current
> >>hardware where channels are limited.
> >>Furthermore, we need to keep a reference count, so that buffers are
> >>put in the LRU (or destroyed) only after they are off all the
> >>channels.
> >>
> >>The LRU list semantics are relaxed. If a buffer has both its fence
> >>emitted before another buffer, and also signaled before it, then it
> >>will be considered as "less recently" used, and the opposite thing
> >>happens if both are after. Otherwise, the order is undetermined.
> >>
> >>This should still guarantee good eviction behavior, and allows to have
> >>the LRU list only contain evictable buffers, while not requiring to
> >>use a dynamic priority queue.
> >>Otherwise, a red-black tree (or an indirect heap with back references)
> >>indexed by the global fence sequence number can guarantee strict
> >>emission-time LRU semantics. This seems unnecessary though.
> >>
> >>Thoughts? Opinions?
> >>Any possible improvement?
> >>
> >
> >I love the first solution but solution to multi fifo side
> >looks overkilling. So i am wondering if we aren't tackling
> >the issue from the wrong side. Below is slightly different
> >approach to the issue.
> >
> >Idea is that memory management is about having bo in place
> >for the GPU. So instead of looking at the problem from the
> >bo side, look at it from the GPU address space side.
> >
> >For simplicity here i am using only one space VRAM (but GTT
> >or others can be deal the sameway). So now we split the
> >problem to look at it as a transaction any time the driver
> >want to do somethings it ask to ttm to create new transaction
> >each transaction is associated to a fence and list of buffer
> >with their possible placement (i will explain how it can
> >fit multi fifo gpu well too and how it also allow easy support
> >to GPU with virtual address space which is somethings slowly
> >becoming a reality i think).
> >
> >TTM receive a transaction it takes a transaction lock and give
> >transaction number to this new transaction. Then it goes over
> >the BO list and try to come up with the best placement for
> >each of them.
> Jerome, This seems to have a fundamental flaw of not allowing
> simultaneous command submission.
> 
> With the current code, A process blocked trying to validate its
> buffers will not block other processes from validating, and we
> should keep it that way.
> 
> /Thomas
> 

Idea is who comes first is first to be serve :) and it can deal
with multiple command submission if the hw can deal with it
(sync stuff i am briefly describing in multi fifo part). But
yes if process A is first in the kernel and process B could
have been executed right away, A will win and have right to
evict others, i don't think it will hurt perf, i think most
of the usecase we will have enough memory and better userspace
driver that the eviction from vram should become rare enough
to be noticeable.

Cheers,
Jerome