[Intel-gfx] [PATCH 2/3] drm/i915: Return immediately if trylock fails for direct-reclaim

Fri Nov 9 11:44:22 UTC 2018

Quoting Tvrtko Ursulin (2018-11-09 07:30:34)
> 
> On 08/11/2018 16:48, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-11-08 16:23:08)
> >>
> >> On 08/11/2018 08:17, Chris Wilson wrote:
> >>> Ignore trying to shrink from i915 if we fail to acquire the struct_mutex
> >>> in the shrinker while performing direct-reclaim. The trade-off being
> >>> (much) lower latency for non-i915 clients at an increased risk of being
> >>> unable to obtain a page from direct-reclaim without hitting the
> >>> oom-notifier. The proviso being that we still keep trying to hard
> >>> obtain the lock for oom so that we can reap under heavy memory pressure.
> >>>
> >>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >>> Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> >>> ---
> >>>    drivers/gpu/drm/i915/i915_gem_shrinker.c | 24 +++++++++++-------------
> >>>    1 file changed, 11 insertions(+), 13 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/i915/i915_gem_shrinker.c b/drivers/gpu/drm/i915/i915_gem_shrinker.c
> >>> index ea90d3a0d511..d461f458f4af 100644
> >>> --- a/drivers/gpu/drm/i915/i915_gem_shrinker.c
> >>> +++ b/drivers/gpu/drm/i915/i915_gem_shrinker.c
> >>> @@ -36,7 +36,9 @@
> >>>    #include "i915_drv.h"
> >>>    #include "i915_trace.h"
> >>>    
> >>> -static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
> >>> +static bool shrinker_lock(struct drm_i915_private *i915,
> >>> +                       unsigned int flags,
> >>> +                       bool *unlock)
> >>>    {
> >>>        switch (mutex_trylock_recursive(&i915->drm.struct_mutex)) {
> >>>        case MUTEX_TRYLOCK_RECURSIVE:
> >>> @@ -45,15 +47,11 @@ static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
> >>>    
> >>>        case MUTEX_TRYLOCK_FAILED:
> >>>                *unlock = false;
> >>> -             preempt_disable();
> >>> -             do {
> >>> -                     cpu_relax();
> >>> -                     if (mutex_trylock(&i915->drm.struct_mutex)) {
> >>> -                             *unlock = true;
> >>> -                             break;
> >>> -                     }
> >>> -             } while (!need_resched());
> >>> -             preempt_enable();
> >>> +             if (flags & I915_SHRINK_ACTIVE) {
> >>
> >> So until I915_SHRINK_ACTIVE, which is the last ditch attempt to shrink
> >> in the normal case (direct reclaim?) or oom, we bail out on the first
> >> sign of struct mutex contention. Doesn't this make our shrinker much
> >> less effective at runtime and why is that OK?
> > 
> > As I said, it's a tradeoff between blocking others for _several_
> > _seconds_ and making no progress and returning immediately and making no
> > progress. My argument is along the lines of if direct-reclaim is running
> > in another process and something else is engaged in the driver hopefully
> > the driver will be cleaning up as it goes along or else what remains is
> > active and won't be reaped anyway. If direct reclaim is failing, the
> > delay before trying the oom path is insignificant.
> 
> What was the rationale behind busy looping there btw?

Emulating the optimistic spin for mutex (my patches to expose it from
kernel/locking were kept hidden for public decency). My thinking was the
exact opposite to this patch, that direct reclaim was of paramount
importance and spending the time to try and ensure we grabbed the
struct_mutex to search for some pages to free was preferable.

It's just on the basis of looking at the actual syslatency and realising
the cause is this spinner, I want to swing the axe in other direction.

(There's probably a compromise, but honestly I'd prefer to sell the
struct_mutex free version of the shrinker first :)

> Compared to 
> perhaps an alternative of micro-sleeps and trying a few times? I know it 
> would be opposite from what this patch is trying to achieve, I Just 
> don't had a good judgment on what makes most sense for the shrinker. Is 
> it better to perhaps try a little bit harder instead of giving up 
> immediately, but try a little bit harder in a softer way? Or that ends 
> up blocking the callers and has the same effect of making no progress?

Exactly. We can definitely measure the impact of the spinner on
unrelated processes, but detecting the premature allocation failure is
harder (we wait for more dmesg-warns). The compromise that I've tried to
reach here is that if direct-reclaim isn't enough, then we should still
try hard to grab the struct_mutex. (That leaves __GFP_RETRY_MAYFAIL
vulnerable to not shrinking i915, but a worthwhile compromise as it's
allowed to fail?)

> >> Or in other words, for what use cases, tests or benchmark was the
> >> existing approach of busy looping a problem?
> > 
> > Do something like 'find / -exec cat' while running i915 and see how long
> > you have to wait for a khungtaskd :|
> 
> I couldn't reproduce anything strange with this. Assuming you meant 
> something like -exec cat { } \; >dev/null.
> 
> Either case I think explanations like this should go into the commit 
> message.

Weird, I spent so long over the last few weeks talking about the impact
on gem_syslatency, I thought it was mentioned here.

> > gem_syslatency reports max latencies of over 600s, and I'm sure it's
> > pretty much unbounded. It's bad enough that I expect we are the cause of
> > significant hitching (mainly in other tasks) on the desktop running at
> > memory capacity.
> 
> Running gem_syslatency in parallel to the find, or gem_syslatency -b in 
> parallel or not did not do anything.

gem_syslatency -b -m

+------------------------------------------------------------------------+
|      x                                                                 |
|      x                                                                 |
.
.
.
|      x                                                                 |
|      x xxx      xx x           x  x      x                            x|
||_____M_A_______|                                                       |
+------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x 120       2284928  6.752085e+08       3385097      20825362      80352645

Unpatched, and similar have been observed on ivb/byt/bdw/bsw/skl/kbl and
while we were fighting ksoftirqd. 675s maximum latency for a RT thread to wake up.

Patched, max is around 261ms and order of magnitude better than the best
previously!

> Then I tries gem_shrink but that doesn't seem to be making any progress 
> with or without the patch. But it's not even hitting the i915 hard, but 
> perhaps I need to turn off lockdep.. Which would be a bummer since I 
> wanted to have a lockdep to check the next bit..

gem_shrink never has completed... It's not quite the right test for this
as we need to hit the shrinker & oom on both i915 and non-i915 paths to
try and hit all the different lock chains.

> >>> +                     mutex_lock_nested(&i915->drm.struct_mutex,
> >>> +                                       I915_MM_SHRINKER);
> >>
> >> _nested is needed since abandoning trylock would otherwise cause lockdep
> >> issues? But is it really safe? I don't know.. how can it be? It is
> >> guaranteed to be a different process here otherwise the result wouldn't
> >> be MUTEX_TRYLOCK_FAILED.
> > 
> > The easy option was to only force the mutex_lock for kswapd. However,
> > noting that we do need to forcibly shrink before oom, I opted for the
> > more inclusive attempt to take it in both. For the oom approach, my only
> > handwaving is we shouldn't be nested under oom serialisation and so
> > avoid an ABBA deadlock. It's survived some oom abuse, but the machine
> > still becomes unresponsive (but pingable) eventually (as it does before
> > the patch).
> 
> ...which I completely did not understand! :(
> 
> When you say kswapd - that means mutex->owner != current in the shrinker 
> and otherwise it is direct reclaim?
> 
> If true, how is the nested annotation correct for the kswapd case?

We have 3 paths,
direct-reclaim / oom from under i915 + struct_mutex
direct-reclaim / oom from elsewhere
kswapd

The nested annotation doesn't mean that the lock is nested per-se, it
just means that here it is acting as a different subclass of the lock.
Since the kswapd path can't have struct_mutex, it is safe from being
confused. The complication is if there is serialisation on
direct-reclaim (which I think can be run concurrently, i.e. no
mutex/semaphore nesting) or oom. For oom, there is a lock but afaict the
guard is just a trylock so shouldn't cause an ABBA deadlock if we oom
under struct_mutex while fighting an oom in another process.
-Chris