[Intel-gfx] [PATCH 2/3] drm/i915: Return immediately if trylock fails for direct-reclaim

Tvrtko Ursulin tvrtko.ursulin at linux.intel.com
Tue Nov 13 10:24:43 UTC 2018


On 09/11/2018 11:44, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-11-09 07:30:34)
>>
>> On 08/11/2018 16:48, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2018-11-08 16:23:08)
>>>>
>>>> On 08/11/2018 08:17, Chris Wilson wrote:
>>>>> Ignore trying to shrink from i915 if we fail to acquire the struct_mutex
>>>>> in the shrinker while performing direct-reclaim. The trade-off being
>>>>> (much) lower latency for non-i915 clients at an increased risk of being
>>>>> unable to obtain a page from direct-reclaim without hitting the
>>>>> oom-notifier. The proviso being that we still keep trying to hard
>>>>> obtain the lock for oom so that we can reap under heavy memory pressure.
>>>>>
>>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>>> ---
>>>>>     drivers/gpu/drm/i915/i915_gem_shrinker.c | 24 +++++++++++-------------
>>>>>     1 file changed, 11 insertions(+), 13 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem_shrinker.c b/drivers/gpu/drm/i915/i915_gem_shrinker.c
>>>>> index ea90d3a0d511..d461f458f4af 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_gem_shrinker.c
>>>>> +++ b/drivers/gpu/drm/i915/i915_gem_shrinker.c
>>>>> @@ -36,7 +36,9 @@
>>>>>     #include "i915_drv.h"
>>>>>     #include "i915_trace.h"
>>>>>     
>>>>> -static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
>>>>> +static bool shrinker_lock(struct drm_i915_private *i915,
>>>>> +                       unsigned int flags,
>>>>> +                       bool *unlock)
>>>>>     {
>>>>>         switch (mutex_trylock_recursive(&i915->drm.struct_mutex)) {
>>>>>         case MUTEX_TRYLOCK_RECURSIVE:
>>>>> @@ -45,15 +47,11 @@ static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
>>>>>     
>>>>>         case MUTEX_TRYLOCK_FAILED:
>>>>>                 *unlock = false;
>>>>> -             preempt_disable();
>>>>> -             do {
>>>>> -                     cpu_relax();
>>>>> -                     if (mutex_trylock(&i915->drm.struct_mutex)) {
>>>>> -                             *unlock = true;
>>>>> -                             break;
>>>>> -                     }
>>>>> -             } while (!need_resched());
>>>>> -             preempt_enable();
>>>>> +             if (flags & I915_SHRINK_ACTIVE) {
>>>>
>>>> So until I915_SHRINK_ACTIVE, which is the last ditch attempt to shrink
>>>> in the normal case (direct reclaim?) or oom, we bail out on the first
>>>> sign of struct mutex contention. Doesn't this make our shrinker much
>>>> less effective at runtime and why is that OK?
>>>
>>> As I said, it's a tradeoff between blocking others for _several_
>>> _seconds_ and making no progress and returning immediately and making no
>>> progress. My argument is along the lines of if direct-reclaim is running
>>> in another process and something else is engaged in the driver hopefully
>>> the driver will be cleaning up as it goes along or else what remains is
>>> active and won't be reaped anyway. If direct reclaim is failing, the
>>> delay before trying the oom path is insignificant.
>>
>> What was the rationale behind busy looping there btw?
> 
> Emulating the optimistic spin for mutex (my patches to expose it from
> kernel/locking were kept hidden for public decency). My thinking was the
> exact opposite to this patch, that direct reclaim was of paramount
> importance and spending the time to try and ensure we grabbed the
> struct_mutex to search for some pages to free was preferable.
> 
> It's just on the basis of looking at the actual syslatency and realising
> the cause is this spinner, I want to swing the axe in other direction.
> 
> (There's probably a compromise, but honestly I'd prefer to sell the
> struct_mutex free version of the shrinker first :)
> 
>> Compared to
>> perhaps an alternative of micro-sleeps and trying a few times? I know it
>> would be opposite from what this patch is trying to achieve, I Just
>> don't had a good judgment on what makes most sense for the shrinker. Is
>> it better to perhaps try a little bit harder instead of giving up
>> immediately, but try a little bit harder in a softer way? Or that ends
>> up blocking the callers and has the same effect of making no progress?
> 
> Exactly. We can definitely measure the impact of the spinner on
> unrelated processes, but detecting the premature allocation failure is
> harder (we wait for more dmesg-warns). The compromise that I've tried to
> reach here is that if direct-reclaim isn't enough, then we should still
> try hard to grab the struct_mutex. (That leaves __GFP_RETRY_MAYFAIL
> vulnerable to not shrinking i915, but a worthwhile compromise as it's
> allowed to fail?)
> 
>>>> Or in other words, for what use cases, tests or benchmark was the
>>>> existing approach of busy looping a problem?
>>>
>>> Do something like 'find / -exec cat' while running i915 and see how long
>>> you have to wait for a khungtaskd :|
>>
>> I couldn't reproduce anything strange with this. Assuming you meant
>> something like -exec cat { } \; >dev/null.
>>
>> Either case I think explanations like this should go into the commit
>> message.
> 
> Weird, I spent so long over the last few weeks talking about the impact
> on gem_syslatency, I thought it was mentioned here.
> 
>>> gem_syslatency reports max latencies of over 600s, and I'm sure it's
>>> pretty much unbounded. It's bad enough that I expect we are the cause of
>>> significant hitching (mainly in other tasks) on the desktop running at
>>> memory capacity.
>>
>> Running gem_syslatency in parallel to the find, or gem_syslatency -b in
>> parallel or not did not do anything.
> 
> gem_syslatency -b -m
> 
> +------------------------------------------------------------------------+
> |      x                                                                 |
> |      x                                                                 |
> .
> .
> .
> |      x                                                                 |
> |      x xxx      xx x           x  x      x                            x|
> ||_____M_A_______|                                                       |
> +------------------------------------------------------------------------+
>      N           Min           Max        Median           Avg        Stddev
> x 120       2284928  6.752085e+08       3385097      20825362      80352645
> 
> Unpatched, and similar have been observed on ivb/byt/bdw/bsw/skl/kbl and
> while we were fighting ksoftirqd. 675s maximum latency for a RT thread to wake up.
> 
> Patched, max is around 261ms and order of magnitude better than the best
> previously!

Yes I have reproduced the ~20x better max latency myself. Mean latency 
stays roughly the same but has less variance. From that point of view 
this is very good.

> 
>> Then I tries gem_shrink but that doesn't seem to be making any progress
>> with or without the patch. But it's not even hitting the i915 hard, but
>> perhaps I need to turn off lockdep.. Which would be a bummer since I
>> wanted to have a lockdep to check the next bit..
> 
> gem_shrink never has completed... It's not quite the right test for this
> as we need to hit the shrinker & oom on both i915 and non-i915 paths to
> try and hit all the different lock chains.

Ok. :)

> 
>>>>> +                     mutex_lock_nested(&i915->drm.struct_mutex,
>>>>> +                                       I915_MM_SHRINKER);
>>>>
>>>> _nested is needed since abandoning trylock would otherwise cause lockdep
>>>> issues? But is it really safe? I don't know.. how can it be? It is
>>>> guaranteed to be a different process here otherwise the result wouldn't
>>>> be MUTEX_TRYLOCK_FAILED.
>>>
>>> The easy option was to only force the mutex_lock for kswapd. However,
>>> noting that we do need to forcibly shrink before oom, I opted for the
>>> more inclusive attempt to take it in both. For the oom approach, my only
>>> handwaving is we shouldn't be nested under oom serialisation and so
>>> avoid an ABBA deadlock. It's survived some oom abuse, but the machine
>>> still becomes unresponsive (but pingable) eventually (as it does before
>>> the patch).
>>
>> ...which I completely did not understand! :(
>>
>> When you say kswapd - that means mutex->owner != current in the shrinker
>> and otherwise it is direct reclaim?
>>
>> If true, how is the nested annotation correct for the kswapd case?
> 
> We have 3 paths,
> direct-reclaim / oom from under i915 + struct_mutex
> direct-reclaim / oom from elsewher > kswapd

Is the kernel guaranteeing only one reclaim path at a time can happen? 
In other words if it is doing direct reclaim, kswapd won't enter the 
same shrinking notifier and vice versa?

> 
> The nested annotation doesn't mean that the lock is nested per-se, it
> just means that here it is acting as a different subclass of the lock.
> Since the kswapd path can't have struct_mutex, it is safe from being
> confused. The complication is if there is serialisation on
> direct-reclaim (which I think can be run concurrently, i.e. no
> mutex/semaphore nesting) or oom. For oom, there is a lock but afaict the
> guard is just a trylock so shouldn't cause an ABBA deadlock if we oom
> under struct_mutex while fighting an oom in another process.

When you say direct reclaim can run concurrently, in that case we would 
have one thread potentially holding the struct_mutex and entering 
shrinker. Second thread would grab the mutex in the shrinker with the 
nested annotation. But it is not nested but overlapping. And you say 
nested is effectively a hack to enable different lock classes.

Actually in this case it is the same situation for any two simultaneous 
threads in direct reclaim, even if none holds struct mutex on entry. But 
in this case the lockdep annotation is also not needed.

Hm.. maybe to turn the question on it's head.. if you had a normal 
mutex_lock in there, what deadlock would lockdep notice?

Regards,

Tvrtko


More information about the Intel-gfx mailing list