Memory eviction in ttm

Mon Sep 17 14:35:28 PDT 2012

On 09/14/2012 03:36 PM, Maarten Lankhorst wrote:
> Op 14-09-12 12:50, Thomas Hellström schreef:
>> Hi Maarten!
>>
>> Broadening the audience a bit..
>>
>> On 9/14/12 9:12 AM, Maarten Lankhorst wrote:
>>> Op 13-09-12 23:00, Thomas Hellstrom schreef:
>>>> On 09/13/2012 07:13 PM, Maarten Lankhorst wrote:
>>>>> Hey
>>>>>
>>>>> Op 13-09-12 18:41, Thomas Hellstrom schreef:
>>>>>> On 09/13/2012 05:19 PM, Maarten Lankhorst wrote:
>>>>>>> Hey,
>>>>>>>
>>>>>>> Op 12-09-12 15:28, Thomas Hellstrom schreef:
>>>>>>>> On 09/12/2012 02:48 PM, Maarten Lankhorst wrote:
>>>>>>>>> Hey Thomas,
>>>>>>>>>
>>>>>>>>> I'm playing around with moving reservations from ttm to global, but how ttm
>>>>>>>>> ttm is handling reservations is getting in the way.  The code wants to move
>>>>>>>>> the bo from the lru lock at the same time a reservation is made, but that
>>>>>>>>> seems to be slightly too strict. It would really help me if that guarantee
>>>>>>>>> is removed.
>>>>>>>> Hi, Maarten.
>>>>>>>>
>>>>>>>> Removing that restriction is not really possible at the moment.
>>>>>>>> Also the memory accounting code depends on this, and may cause reservations
>>>>>>>> in the most awkward places. Since these reservations don't have a ticket
>>>>>>>> they may and will cause deadlocks. So in short the restriction is there
>>>>>>>> to avoid deadlocks caused by ticketless reservations.
>>>>>>> I have finished the lockdep annotations now which seems to catch almost
>>>>>>> all abuse I threw at it, so I'm feeling slightly more confident about moving
>>>>>>> the locking order and reservations around.
>>>>>> Maarten, moving reservations in TTM out of the lru lock is incorrect as the code is
>>>>>> written now. If we want to move it out we need something for ticketless reservations
>>>>>>
>>>>>> I've been thinking of having a global hash table of tickets with the task struct pointer as the key,
>>>>>> but even then, we'd need to be able to handle EBUSY errors on every operation that might try to
>>>>>> reserve a buffer.
>>>>>>
>>>>>> The fact that lockdep doesn't complain isn't enough. There *will* be deadlock use-cases when TTM is handed
>>>>>> the right data-set.
>>>>>>
>>>>>> Isn't there a way that a subsystem can register a callback to be performed to remove stuff from LRU and
>>>>>> to take a pre-reservation lock?
>>>>> What if multiple subsystems need those? You will end up with a deadlock again.
>>>>>
>>>>> I think it would be easier to change the code in ttm_bo.c to not assume the first
>>>>> item on the lru list is really the least recently used, and assume the first item
>>>>> that can be reserved without blocking IS the least recently used instead.
>>>> So what would happen then is that we'd spin on the first item on the LRU list, since
>>>> when reserving we must release the LRU lock, and if reserving fails, we thus
>>>> need to restart LRU traversal. Typically after a schedule(). That's bad.
>>>>
>>>> So let's take a step back and analyze why the LRU lock has become a problem.
>>>>   From what I can tell, it's because you want to use per-object lock when reserving instead of a
>>>> global reservation lock (that TTM could use as the LRU lock). Is that correct?
>>>> and in that case, in what situation do you envision such a global lock being contended
>>>> to the extent that it hurts performance?
>>>>
>>>>>>> Lockdep WILL complain about trying to use multiple tickets, doing ticketed
>>>>>>> and unticketed blocking reservations mixed, etc.
>>>>>>>
>>>>>>> I want to remove the global fence_lock and make it a per buffer lock, with some
>>>>>>> lockdep annotations it's perfectly legal to grab obj->fence_lock and obj2->fence_lock
>>>>>>> if you have a reservation, but it should complain loudly about trying to take 2 fence_locks
>>>>>>> at the same time without a reservation.
>>>>>> Yes, TTM was previously using per buffer fence locks, and that works fine from a deadlock perspective, but
>>>>>> it hurts performance. Fencing 200 buffers in a command submission (google-earth for example) will mean
>>>>>> 198 unnecessary locks, each discarding the processor pipelines. Locking is a *slow* operation, particularly
>>>>>> on systems with many processors, and I don't think it's a good idea to change that back, without analyzing
>>>>>> the performance impact. There are reasons people are writing stuff like RCU to avoid locking...
>>>>> So why don't we simply use RCU for fence pointers and get rid of the fence locking? :D
>>>>> danvet originally suggested it as a joke but if you think about it, it would make a lot of sense for this usecase.
>>>> I thought of that before, but the problem is you'd still need a spinlock to change the buffer's fence pointer,
>>>> even if reading it becomes quick.
>>> Actually, I changed lockdep annotations a bit to distinguish between the
>>> cases where ttm_bo_wait is called without reservation, and ttm_bo_wait
>>> is called with, as far as I can see there are only 2 places that do it without,
>>> at least if I converted my git tree properly..
>>>
>>> http://cgit.freedesktop.org/~mlankhorst/linux/log/?h=v10-wip
>>>
>>> First one is nouveau_bo_vma_del, this can be fixed easily.
>>> Second one is ttm_bo_cleanup_refs and ttm_bo_cleanup_refs_or_queue,
>>> if reservation is done first before ttm_bo_wait, the fence_lock could be
>>> dropped entirely by adding smb_mb() in reserve and unreserve, functionally
>>> there would be no difference. So if you can verify my lockdep annotations are
>>> correct in the most recent commit wrt what's using ttm_bo_wait without reservation
>>> we could remove the fence_lock entirely.
>>>
>>> ~Maarten
>> Being able to wait for buffer idle or get the fence pointer without reserving is a fundamental property of TTM. Reservation is a long-term lock. The fence lock is a very short term lock. If I were to choose, I'd rather accept per-object fence locks than removing this property, but see below.
> I was trying to say the ONLY place where this is used in ttm is in ttm_bo_cleanup_refs(or *_or_queue).
> No driver depends on this, they all take a reservation first. As such fence_lock can be removed if you
> change the order of the reservation and waiting around, making it behave the same as the rest of ttm.
> http://cgit.freedeskt
> op.org/~mlankhorst/linux/commit/?h=v10-wip&id=b879ac51810408809d6424d26da4013171d1bc15
> fixes up those 2 calls, then removes the fence_lock altogether.

In the nouveau case the function nouveau_bo_vma_del seems to be called 
from different places both with reservation and without reservation,
and while it might look simple to add a reserve around a wait for idle, 
it's a design decision not to require it, because in some situations 
it's just not
a good thing to do.

>> Likewise, to be able to guarantee that a reserved object is not on any LRU list is also an important property. Removing that property will, in addition to the spin wait we've already discussed make understanding TTM locking even more difficult, and I'd really like to avoid it.
> This guarantee will no longer be true if reservations for buffer objects are going to be shared
> across devices, other users of the reservations may have their own requirements.
>
> Why is this property important though? If it could be changed to 'likely unreserved'
> instead of guaranteed not on the lru list the few places that rely on this could be
> fixed to attempt a nonblocking reserve first.

Please see above what I write about spinning around the first item and
list traversal restart.

>
> Is it just ttm_mem_evict_first / ttm_bo_swapout / ttm_bo_force_list_clean
> that would break down? If so those could do a simple iteration over the list
> to find the first unreserved buffer instead, or do the default thing those functions
> do when the list is empty if  they traversed the full list without finding anything
> that could be reserved.

TTM guarantees if desired, each user-space client full access to *all* GPU
memory resources, so that a user-space client reliably knows when to flush.
If it fails buffer validation in the first execbuf pass, it may take an 
exclusive execbuf
lock, evict *all* buffers and retry. Then *all* buffers must be evicted. 
We can't leave
any buffers around that are reserved but not taken off LRU lists.

Also, at final buffer destruction you must be able to guarantee that you 
have the buffer
reserved and that you are the only one holding a reference. That's 
pretty tricky to do
if you don't take the buffer off LRU lists atomically when reserving....

>
>> If this were a real performance problem we were trying to solve it would be easier to motivate changes in this area, but if it's just trying to avoid a global reservation lock and a global fence lock that will rarely if ever see any contention, I can't see the point. On the contrary, having per-object locks will be very costly when reserving / fencing many objects. As mentioned before, in the fence lock case it's been tried and removed, so I'd like to know the reasoning behind introducing it again, and in what situations you think the global locks will be contended.
> Because I'm trying to keep things as simple as possible, and this whole reservation business is already giving me a headache
> so I would really like to avoid any special lock that interacts with reservations in a complex way and the easiest way would be
> to instead accept that there reservation and your lru removal call may not be atomic. I hope to eventually support eviction though,
> so some notification that the buffer was reserved succesfully is going to be likely, I just don't think it needs to be done atomically.
>

Then may I humbly suggest that you put a global spinlock around 
reserving, and a list of callbacks to be called while still locked after 
reserving.
Any subsystem wanting to pull stuff off LRU lists can then register a 
callback. That's no complex interaction.

/Thomas