buggy/weird behavior in ttm

Mon Oct 15 08:37:09 PDT 2012

Op 15-10-12 14:27, Thomas Hellstrom schreef:
> On 10/12/2012 12:09 PM, Maarten Lankhorst wrote:
>> Op 12-10-12 07:57, Thomas Hellstrom schreef:
>>> On 10/11/2012 10:55 PM, Maarten Lankhorst wrote:
>>>> Op 11-10-12 21:26, Thomas Hellstrom schreef:
>>>>> On 10/11/2012 08:42 PM, Maarten Lankhorst wrote:
>>>>>
>>>>>>> Anyway, if you plan to remove the fence lock and protect it with reserve, you must
>>>>>>> make sure that a waiting reserve is never done in a destruction path. I think this
>>>>>>> mostly concerns the nvidia driver.
>>>>>> Well I don't think any lock should ever be held during destruction time,
>>>>> What I mean is, that *something* needs to protect the fence pointer. Currently it's the
>>>>> fence lock, and I was assuming you'd protect it with reserve. And neither TTM nor
>>>>> Nvidia should, when a resource is about to be freed, be forced to *block* waiting for
>>>>> reserve just to access the fence pointer. When and if you have a solution that fulfills
>>>>> those requirements, I'm ready to review it.
>>>> It's not blocking, cleanup_refs_or_queue will toss it on the deferred list if reservation fails,
>>>> behavior doesn't change just because I changed the order around.
>>> Well, I haven't looked into the code in detail yet. If you say it's non-blocking I believe you.
>>> I was actually more concerned abut the Nvidia case where IIRC the wait was called both
>>> with and without reservation.
>>>
>>>
>>>>>>>> - no_wait_reserve is ignored if no_wait_gpu is false
>>>>>>>>       ttm_bo_reserve_locked can only return true if no_wait_reserve is true, but
>>>>>>>>       subsequently it will do a wait_unreserved if no_wait_gpu is false.
>>>>>>>> I'm planning on removing this argument and act like it is always true, since
>>>>>>>> nothing on the lru list should fail to reserve currently.
>>>>>>> Yes, since all buffers that are reserved are removed from the LRU list, there
>>>>>>> should never be a waiting reserve on them, so no_wait_reserve can be removed
>>>>>>> from ttm_mem_evict_first, ttm_bo_evict and possibly other functions in the call chain.
>>>>>> I suppose there will stay a small race though,
>>>>> Hmm, where?
>>>> When you enter the ddestroy path, you drop the lock and hope the buffer doesn't reserved
>>>> away from under you.
>>> Yes, that code isn't fully correct, it's missing a check for still on ddestroy after a waiting
>>> reserve. However, the only chance of a waiting reserve given that the buffer *IS* on the
>>> ddestroy list is if the current reserver returned early because someone started an
>>> accelerated eviction which can't happen currently. The code needs fixing up though.
>>>
>>>>>>>> - effectively unlimited callchain between some functions that all go through
>>>>>>>>       ttm_mem_evict_first:
>>>>>>>>
>>>>>>>>                                         /------------------------\
>>>>>>>> ttm_mem_evict_first - ttm_bo_evict -                          -ttm_bo_mem_space  - ttm_bo_mem_force_space - ttm_mem_evict_first
>>>>>>>>                                         \ ttm_bo_handle_move_mem /
>>>>>>>> I'm not surprised that there was a deadlock before, it seems to me it would
>>>>>>>> be pretty suicidal to ever do a blocking reserve on any of those lists,
>>>>>>>> lockdep would be all over you for this.
>>>>>>> Well, at first this may look worse than it actually is. The driver's eviction memory order determines the recursion depth
>>>>>>> and typically it's 0 or 1, since subsequent ttm_mem_evict_first should never touch the same LRU lists as the first one.
>>>>>>> What would typically happen is that a BO is evicted from VRAM to TT, and if there is no space in TT, another BO is evicted
>>>>>>> to system memory, and the chain is terminated. However a driver could set up any eviction order but that would be
>>>>>>> a BUG.
>>>>>>>
>>>>>>> But in essence, as you say, even with a small recursion depth, a waiting reserve could cause a deadlock.
>>>>>>> But there should be no waiting reserves in the eviction path currently.
>>>>>> Partially true, ttm_bo_cleanup_refs is currently capable of blocking reserve.
>>>>>> Fixing this might mean that ttm_mem_evict_first may need to become more aggressive,
>>>>>> since it seems all the callers of this function assume that ttm_mem_evict_first can only fail
>>>>>> if there is really nothing more to free and blocking nested would really upset lockdep
>>>>>> and leave you open to the same deadlocks.
>>>>> I can't see how the waiting reserve in ttm_bo_cleanup_refs would cause a deadlock,
>>>>> because the buffer about to be reserved is always *last* in a reservation sequence, and the
>>>>> reservation is always released (or the buffer destroyed) before trying to reserve another buffer.
>>>>> Technically the buffer is not looked up from a LRU list but from the delayed delete list.
>>>>> Could you describe such a deadlock case?
>>>> The only interesting case for this is ttm_mem_evict_first, and while it may not technically
>>>> be a deadlock, lockdep will flag you for blocking on this anyway, since the only reason it
>>>> would not be a deadlock is if you know the exact semantics of why.
>>> Interesting. I guess that must be because of the previous reservation history for that buffer?
>>> Let's say we were to reinitialize the lockdep history for the reservation object when it was put
>>> on the ddestroy list, I assume lockdep would keep quiet, because there are never any other
>>> bo reservations while such a buffer is reserved?
>> Lockdep works on classes of lock, not necessarily individual locks.
>>
>> Doing 2 bo_reserve's of any bo would count as possible deadlock, no matter if you
>> always take them in a certain order or not.
>
> So you mean that if I bo_reserve A and then bo_reserve B (which is used only when binding A to the GPU), lockdep will complain even if nobody ever bo_reserves B before A? That will make it impossible to use BOs as page tables for GPU binding for example.

As far as I tell can nobody does it like that, page tables are simply initialized on channel
creation, pinned in memory and kept like that while the host serializes with their own vm locking.

>> To make multi-object reservation work, the fix is to add a ticket "lock" of which all the
>> reservation objects are a nested lock of. Since in this case the ticket lock would prevent
>> deadlocks, this is acceptable.  Having 2 ticket 'locks' at the same time would count as
>> deadlock, rightfully. If you hold a reservation from a ticket, then try to reserve without
>> a ticket, it counts as deadlock too. See below for some examples I was using to test.
>
> But if a ticket lock can be used to abstract a locking sequence that we know will never deadlock,
> why can't we abstract locking from a list with atomic list removal with another "lock", and make sure we get the order right between them?

No, see the test below, lockdep won't be fooled by your diversions that easily!! :-)

However, if the following rules are valid, life becomes a lot easier:
  1. items on the ddestroy list are not allowed to have a new sync object attached, only
     reservations for destruction of this object are allowed.
  2. the only valid thing at this point left is unmapping from vm space and destruction
     of buffer backing in memory or vram
  3. All reservations outside of ttm_bo_cleanup_refs(or_queue) are forbidden.
     (XXX: check if an exception needed for drivers to accomplish this destruction?)

Adding a WARN_ON(!list_empty(&bo->ddestroy)); to ttm_bo_reserve and ttm_eu_reserve_buffers
should be enough to catch violators of the above rules.

If those rules hold, destruction becomes a lot more straightforward.
ttm_bo_swapout and ttm_mem_evict_first can both do the reserve with lru_lock held,
which will be guaranteed to succeed, and the buffer is also guaranteed to be
on the ddestroy list still since we haven't dropped the lock yet.

So with a reservation, lru_lock held, and entry on the ddestroy list still alive:

Do a trywait on the bo, if it returns -EBUSY, and no_wait_gpu is unset,
get a reference to bo->sync_obj. unreserve.

If -EBUSY was returned and no_wait_gpu is set, unlock lru_lock and return error.

If -EBUSY and no_wait_gpu was not set, unlock lru_lock, do a wait on our copy of
bo->sync_obj, and unref it, return if wait fails. If wait hasn't failed, retake
lru_lock.

Check if the if the ddestroy entry is gone. If so, someone else was faster,
return 0. If not simply remove bo from all lists without reservation,
unref bo->sync_obj**, and perform the remaining cleanup.

As far as I can see, this wouldn't leave open a race even if 2 threads do the same,
although the second thread might return 0 to signal success before the backing is
gone, but this can also happen right now. It's even harmless, since in those cases
the functions calling these functions would simply call them one more time, and this
time the destroyed buffer would definitely not be on the swap/lru lists any more.

It would also keep current behavior the same as far as I can tell, where multiple
waiters could wait for the same bo to be destroyed.

**) Yes strictly speaking a violation of fence rules, but in this case the buffer
already dropped to 0 references, and there's a BUG_ON in ttm_bo_release_list that
would get triggered otherwise. Can alternatively be fixed by doing a BUG_ON in
ttm_bo_release_list if there is a sync_obj that's not in the signaled state.

The test I mentioned:

>> static void reservation_test_block_ticket(void)
>> {
>>     struct reservation_ticket t;
>>     struct reservation_object o, o2;
>>
>>     reservation_object_init(&o);
>>     reservation_object_init(&o2);
>>
>>     object_reserve(&o, false, false, NULL);
>>     reservation_ticket_init(&t);
>>
>>     object_reserve(&o2, false, false, &t);
>>     object_unreserve(&o2, &t);
>>     object_unreserve(&o, NULL);
>>
>>     reservation_ticket_fini(&t);
>> }
(snip)
>>     dotest(reservation_test_block_ticket, FAILURE, LOCKTYPE_RESERVATION);