[Intel-gfx] [PATCH 2/3] drm/i915: Return immediately if trylock fails for direct-reclaim

Thu Nov 8 16:48:05 UTC 2018

Quoting Tvrtko Ursulin (2018-11-08 16:23:08)
> 
> On 08/11/2018 08:17, Chris Wilson wrote:
> > Ignore trying to shrink from i915 if we fail to acquire the struct_mutex
> > in the shrinker while performing direct-reclaim. The trade-off being
> > (much) lower latency for non-i915 clients at an increased risk of being
> > unable to obtain a page from direct-reclaim without hitting the
> > oom-notifier. The proviso being that we still keep trying to hard
> > obtain the lock for oom so that we can reap under heavy memory pressure.
> > 
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_gem_shrinker.c | 24 +++++++++++-------------
> >   1 file changed, 11 insertions(+), 13 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_shrinker.c b/drivers/gpu/drm/i915/i915_gem_shrinker.c
> > index ea90d3a0d511..d461f458f4af 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_shrinker.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_shrinker.c
> > @@ -36,7 +36,9 @@
> >   #include "i915_drv.h"
> >   #include "i915_trace.h"
> >   
> > -static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
> > +static bool shrinker_lock(struct drm_i915_private *i915,
> > +                       unsigned int flags,
> > +                       bool *unlock)
> >   {
> >       switch (mutex_trylock_recursive(&i915->drm.struct_mutex)) {
> >       case MUTEX_TRYLOCK_RECURSIVE:
> > @@ -45,15 +47,11 @@ static bool shrinker_lock(struct drm_i915_private *i915, bool *unlock)
> >   
> >       case MUTEX_TRYLOCK_FAILED:
> >               *unlock = false;
> > -             preempt_disable();
> > -             do {
> > -                     cpu_relax();
> > -                     if (mutex_trylock(&i915->drm.struct_mutex)) {
> > -                             *unlock = true;
> > -                             break;
> > -                     }
> > -             } while (!need_resched());
> > -             preempt_enable();
> > +             if (flags & I915_SHRINK_ACTIVE) {
> 
> So until I915_SHRINK_ACTIVE, which is the last ditch attempt to shrink 
> in the normal case (direct reclaim?) or oom, we bail out on the first 
> sign of struct mutex contention. Doesn't this make our shrinker much 
> less effective at runtime and why is that OK?

As I said, it's a tradeoff between blocking others for _several_
_seconds_ and making no progress and returning immediately and making no
progress. My argument is along the lines of if direct-reclaim is running
in another process and something else is engaged in the driver hopefully
the driver will be cleaning up as it goes along or else what remains is
active and won't be reaped anyway. If direct reclaim is failing, the
delay before trying the oom path is insignificant.

> Or in other words, for what use cases, tests or benchmark was the 
> existing approach of busy looping a problem?

Do something like 'find / -exec cat' while running i915 and see how long
you have to wait for a khungtaskd :|

gem_syslatency reports max latencies of over 600s, and I'm sure it's
pretty much unbounded. It's bad enough that I expect we are the cause of
significant hitching (mainly in other tasks) on the desktop running at
memory capacity.

> > +                     mutex_lock_nested(&i915->drm.struct_mutex,
> > +                                       I915_MM_SHRINKER);
> 
> _nested is needed since abandoning trylock would otherwise cause lockdep 
> issues? But is it really safe? I don't know.. how can it be? It is 
> guaranteed to be a different process here otherwise the result wouldn't 
> be MUTEX_TRYLOCK_FAILED.

The easy option was to only force the mutex_lock for kswapd. However,
noting that we do need to forcibly shrink before oom, I opted for the
more inclusive attempt to take it in both. For the oom approach, my only
handwaving is we shouldn't be nested under oom serialisation and so
avoid an ABBA deadlock. It's survived some oom abuse, but the machine
still becomes unresponsive (but pingable) eventually (as it does before
the patch).

> Also, I915_MM_SHRINKER is so far only documented to apply to obj->mm.lock.

It suits very nicely as it being applied for the same purpose.
-Chris