[Intel-gfx] [PATCH 2/3] drm/i915: Return immediately if trylock fails for direct-reclaim

Thu Nov 15 11:39:31 UTC 2018

On 13/11/2018 17:10, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-11-13 10:24:43)
>>
>> On 09/11/2018 11:44, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2018-11-09 07:30:34)
>>>>
>>>> On 08/11/2018 16:48, Chris Wilson wrote:
>>>>> Quoting Tvrtko Ursulin (2018-11-08 16:23:08)
>>>>>>
>>>>>> On 08/11/2018 08:17, Chris Wilson wrote:
>>>>>>> Ignore trying to shrink from i915 if we fail to acquire the struct_mutex
>>>>>>> in the shrinker while performing direct-reclaim. The trade-off being
>>>>>>> (much) lower latency for non-i915 clients at an increased risk of being
>>>>>>> unable to obtain a page from direct-reclaim without hitting the
>>>>>>> oom-notifier. The proviso being that we still keep trying to hard
>>>>>>> obtain the lock for oom so that we can reap under heavy memory pressure.
>>>>>>>
>>>>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>>>> ---
>>>>>>>      drivers/gpu/drm/i915/i915_gem_shrinker.c | 24 +++++++++++-------------
>>>>>>>      1 file changed, 11 insertions(+), 13 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem_shrinker.c b/drivers/gpu/drm/i915/i915_gem_shrinker.c
>>>>>>> index ea90d3a0d511..d461f458f4af 100644
>>>>>>> --- a/drivers/gpu/drm/i915/i915_gem_shrinker.c
>>>>>>> +++ b/drivers/gpu/drm/i915/i915_gem_shrinker.c
>>>>>>> @@ -36,7 +36,9 @@
>>>>>>>      #include "i915_drv.h"
>>>>>>>      #include "i915_trace.h"
>>>>>>>      
>>>>>>> -static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
>>>>>>> +static bool shrinker_lock(struct drm_i915_private *i915,
>>>>>>> +                       unsigned int flags,
>>>>>>> +                       bool *unlock)
>>>>>>>      {
>>>>>>>          switch (mutex_trylock_recursive(&i915->drm.struct_mutex)) {
>>>>>>>          case MUTEX_TRYLOCK_RECURSIVE:
>>>>>>> @@ -45,15 +47,11 @@ static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
>>>>>>>      
>>>>>>>          case MUTEX_TRYLOCK_FAILED:
>>>>>>>                  *unlock = false;
>>>>>>> -             preempt_disable();
>>>>>>> -             do {
>>>>>>> -                     cpu_relax();
>>>>>>> -                     if (mutex_trylock(&i915->drm.struct_mutex)) {
>>>>>>> -                             *unlock = true;
>>>>>>> -                             break;
>>>>>>> -                     }
>>>>>>> -             } while (!need_resched());
>>>>>>> -             preempt_enable();
>>>>>>> +             if (flags & I915_SHRINK_ACTIVE) {
>>>>>>
>>>>>> So until I915_SHRINK_ACTIVE, which is the last ditch attempt to shrink
>>>>>> in the normal case (direct reclaim?) or oom, we bail out on the first
>>>>>> sign of struct mutex contention. Doesn't this make our shrinker much
>>>>>> less effective at runtime and why is that OK?
>>>>>
>>>>> As I said, it's a tradeoff between blocking others for _several_
>>>>> _seconds_ and making no progress and returning immediately and making no
>>>>> progress. My argument is along the lines of if direct-reclaim is running
>>>>> in another process and something else is engaged in the driver hopefully
>>>>> the driver will be cleaning up as it goes along or else what remains is
>>>>> active and won't be reaped anyway. If direct reclaim is failing, the
>>>>> delay before trying the oom path is insignificant.
>>>>
>>>> What was the rationale behind busy looping there btw?
>>>
>>> Emulating the optimistic spin for mutex (my patches to expose it from
>>> kernel/locking were kept hidden for public decency). My thinking was the
>>> exact opposite to this patch, that direct reclaim was of paramount
>>> importance and spending the time to try and ensure we grabbed the
>>> struct_mutex to search for some pages to free was preferable.
>>>
>>> It's just on the basis of looking at the actual syslatency and realising
>>> the cause is this spinner, I want to swing the axe in other direction.
>>>
>>> (There's probably a compromise, but honestly I'd prefer to sell the
>>> struct_mutex free version of the shrinker first :)
>>>
>>>> Compared to
>>>> perhaps an alternative of micro-sleeps and trying a few times? I know it
>>>> would be opposite from what this patch is trying to achieve, I Just
>>>> don't had a good judgment on what makes most sense for the shrinker. Is
>>>> it better to perhaps try a little bit harder instead of giving up
>>>> immediately, but try a little bit harder in a softer way? Or that ends
>>>> up blocking the callers and has the same effect of making no progress?
>>>
>>> Exactly. We can definitely measure the impact of the spinner on
>>> unrelated processes, but detecting the premature allocation failure is
>>> harder (we wait for more dmesg-warns). The compromise that I've tried to
>>> reach here is that if direct-reclaim isn't enough, then we should still
>>> try hard to grab the struct_mutex. (That leaves __GFP_RETRY_MAYFAIL
>>> vulnerable to not shrinking i915, but a worthwhile compromise as it's
>>> allowed to fail?)
>>>
>>>>>> Or in other words, for what use cases, tests or benchmark was the
>>>>>> existing approach of busy looping a problem?
>>>>>
>>>>> Do something like 'find / -exec cat' while running i915 and see how long
>>>>> you have to wait for a khungtaskd :|
>>>>
>>>> I couldn't reproduce anything strange with this. Assuming you meant
>>>> something like -exec cat { } \; >dev/null.
>>>>
>>>> Either case I think explanations like this should go into the commit
>>>> message.
>>>
>>> Weird, I spent so long over the last few weeks talking about the impact
>>> on gem_syslatency, I thought it was mentioned here.
>>>
>>>>> gem_syslatency reports max latencies of over 600s, and I'm sure it's
>>>>> pretty much unbounded. It's bad enough that I expect we are the cause of
>>>>> significant hitching (mainly in other tasks) on the desktop running at
>>>>> memory capacity.
>>>>
>>>> Running gem_syslatency in parallel to the find, or gem_syslatency -b in
>>>> parallel or not did not do anything.
>>>
>>> gem_syslatency -b -m
>>>
>>> +------------------------------------------------------------------------+
>>> |      x                                                                 |
>>> |      x                                                                 |
>>> .
>>> .
>>> .
>>> |      x                                                                 |
>>> |      x xxx      xx x           x  x      x                            x|
>>> ||_____M_A_______|                                                       |
>>> +------------------------------------------------------------------------+
>>>       N           Min           Max        Median           Avg        Stddev
>>> x 120       2284928  6.752085e+08       3385097      20825362      80352645
>>>
>>> Unpatched, and similar have been observed on ivb/byt/bdw/bsw/skl/kbl and
>>> while we were fighting ksoftirqd. 675s maximum latency for a RT thread to wake up.
>>>
>>> Patched, max is around 261ms and order of magnitude better than the best
>>> previously!
>>
>> Yes I have reproduced the ~20x better max latency myself. Mean latency
>> stays roughly the same but has less variance. From that point of view
>> this is very good.
>>
>>>
>>>> Then I tries gem_shrink but that doesn't seem to be making any progress
>>>> with or without the patch. But it's not even hitting the i915 hard, but
>>>> perhaps I need to turn off lockdep.. Which would be a bummer since I
>>>> wanted to have a lockdep to check the next bit..
>>>
>>> gem_shrink never has completed... It's not quite the right test for this
>>> as we need to hit the shrinker & oom on both i915 and non-i915 paths to
>>> try and hit all the different lock chains.
>>
>> Ok. :)
>>
>>>
>>>>>>> +                     mutex_lock_nested(&i915->drm.struct_mutex,
>>>>>>> +                                       I915_MM_SHRINKER);
>>>>>>
>>>>>> _nested is needed since abandoning trylock would otherwise cause lockdep
>>>>>> issues? But is it really safe? I don't know.. how can it be? It is
>>>>>> guaranteed to be a different process here otherwise the result wouldn't
>>>>>> be MUTEX_TRYLOCK_FAILED.
>>>>>
>>>>> The easy option was to only force the mutex_lock for kswapd. However,
>>>>> noting that we do need to forcibly shrink before oom, I opted for the
>>>>> more inclusive attempt to take it in both. For the oom approach, my only
>>>>> handwaving is we shouldn't be nested under oom serialisation and so
>>>>> avoid an ABBA deadlock. It's survived some oom abuse, but the machine
>>>>> still becomes unresponsive (but pingable) eventually (as it does before
>>>>> the patch).
>>>>
>>>> ...which I completely did not understand! :(
>>>>
>>>> When you say kswapd - that means mutex->owner != current in the shrinker
>>>> and otherwise it is direct reclaim?
>>>>
>>>> If true, how is the nested annotation correct for the kswapd case?
>>>
>>> We have 3 paths,
>>> direct-reclaim / oom from under i915 + struct_mutex
>>> direct-reclaim / oom from elsewher > kswapd
>>
>> Is the kernel guaranteeing only one reclaim path at a time can happen?
>> In other words if it is doing direct reclaim, kswapd won't enter the
>> same shrinking notifier and vice versa?
> 
> direct-reclaim is parallelised; the shrink_slab is only guarded by a
> read lock (to prevent the shrinker lists being modified). Multiple
> callers into direct-reclaim is allowed, and the caller is not always
> guaranteed of obtaining the pages they themselves freed. Afaik, there's
> no lock there to cause contention and inversion for us.
> 
>>> The nested annotation doesn't mean that the lock is nested per-se, it
>>> just means that here it is acting as a different subclass of the lock.
>>> Since the kswapd path can't have struct_mutex, it is safe from being
>>> confused. The complication is if there is serialisation on
>>> direct-reclaim (which I think can be run concurrently, i.e. no
>>> mutex/semaphore nesting) or oom. For oom, there is a lock but afaict the
>>> guard is just a trylock so shouldn't cause an ABBA deadlock if we oom
>>> under struct_mutex while fighting an oom in another process.
>>
>> When you say direct reclaim can run concurrently, in that case we would
>> have one thread potentially holding the struct_mutex and entering
>> shrinker. Second thread would grab the mutex in the shrinker with the
>> nested annotation. But it is not nested but overlapping. And you say
>> nested is effectively a hack to enable different lock classes.
>>
>> Actually in this case it is the same situation for any two simultaneous
>> threads in direct reclaim, even if none holds struct mutex on entry. But
>> in this case the lockdep annotation is also not needed.
>>
>> Hm.. maybe to turn the question on it's head.. if you had a normal
>> mutex_lock in there, what deadlock would lockdep notice?
> 
> The direct-reclaim paths are explicitly marked up so as to avoid taking
> a mutex inside the shrinker that may be held while allocating. That's
> the contention we avoid by special casing the recursive mutex; but it
> also means we can't use a plain mutex_lock() for the cases where we are
> happy that we won't deadlock. Hence why previously it was a trylock to
> avoid the same lockdep warning.

Should we now drop use of mutex_trylock_recursive to employ the owner 
trick directly from the shrinker? Like more or less:

shrinker_lock(..)
{
         if (unlikely(__mutex_owner(lock) == current)) {
		*unlock = false;
                 return true;
	}

	mutex_lock_nested(...)
	*unlock = true;

	return true;
}

Regards,

Tvrtko