[Intel-gfx] [PATCH v2 1/3] drm/i915: Fix eviction when the GGTT is idle but full

Chris Wilson chris at chris-wilson.co.uk
Wed Oct 11 12:12:52 UTC 2017


Quoting Tvrtko Ursulin (2017-10-11 13:05:02)
> 
> On 10/10/2017 22:38, Chris Wilson wrote:
> > In the full-ppgtt world, we can fill the GGTT full of context objects.
> > These context objects are currently implicitly tracked by the requests
> > that pin them i.e. they are only unpinned when the request is completed
> > and retired, but we do not have the link from the vma to the request
> > (anymore). In order to unpin those contexts, we have to issue another
> > request and wait upon the switch to the kernel context.
> > 
> > The bug during eviction was that we assumed that a full GGTT meant we
> > would have requests on the GGTT timeline, and so we missed situations
> > where those requests where merely in flight (and when even they have not
> > yet been submitted to hw yet). The fix employed here is to change the
> > already-is-idle test to no look at the execution timeline, but count the
> > outstanding requests and then check that we have switched to the kernel
> > context. Erring on the side of overkill here just means that we stall a
> > little longer than may be strictly required, but we only expect to hit
> > this path in extreme corner cases where returning an erroneous error is
> > worse than the delay.
> > 
> > v2: Logical inversion when swapping over branches.
> > 
> > Fixes: 80b204bce8f2 ("drm/i915: Enable multiple timelines")
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> > Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_gem_evict.c | 63 ++++++++++++++++++++++-------------
> >   1 file changed, 39 insertions(+), 24 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c b/drivers/gpu/drm/i915/i915_gem_evict.c
> > index a5a5b7e6daae..ee4811ffb7aa 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_evict.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_evict.c
> > @@ -33,21 +33,20 @@
> >   #include "intel_drv.h"
> >   #include "i915_trace.h"
> >   
> > -static bool ggtt_is_idle(struct drm_i915_private *dev_priv)
> > +static bool ggtt_is_idle(struct drm_i915_private *i915)
> >   {
> > -     struct i915_ggtt *ggtt = &dev_priv->ggtt;
> > -     struct intel_engine_cs *engine;
> > -     enum intel_engine_id id;
> > +       struct intel_engine_cs *engine;
> > +       enum intel_engine_id id;
> >   
> > -     for_each_engine(engine, dev_priv, id) {
> > -             struct intel_timeline *tl;
> > +       if (i915->gt.active_requests)
> > +            return false;
> >   
> > -             tl = &ggtt->base.timeline.engine[engine->id];
> > -             if (i915_gem_active_isset(&tl->last_request))
> > -                     return false;
> > -     }
> > +       for_each_engine(engine, i915, id) {
> > +            if (engine->last_retired_context != i915->kernel_context)
> > +                    return false;
> > +       }
> >   
> > -     return true;
> > +       return true;
> >   }
> >   
> >   static int ggtt_flush(struct drm_i915_private *i915)
> > @@ -157,7 +156,8 @@ i915_gem_evict_something(struct i915_address_space *vm,
> >                                   min_size, alignment, cache_level,
> >                                   start, end, mode);
> >   
> > -     /* Retire before we search the active list. Although we have
> > +     /*
> > +      * Retire before we search the active list. Although we have
> >        * reasonable accuracy in our retirement lists, we may have
> >        * a stray pin (preventing eviction) that can only be resolved by
> >        * retiring.
> > @@ -182,7 +182,8 @@ i915_gem_evict_something(struct i915_address_space *vm,
> >               BUG_ON(ret);
> >       }
> >   
> > -     /* Can we unpin some objects such as idle hw contents,
> > +     /*
> > +      * Can we unpin some objects such as idle hw contents,
> >        * or pending flips? But since only the GGTT has global entries
> >        * such as scanouts, rinbuffers and contexts, we can skip the
> >        * purge when inspecting per-process local address spaces.
> > @@ -190,19 +191,33 @@ i915_gem_evict_something(struct i915_address_space *vm,
> >       if (!i915_is_ggtt(vm) || flags & PIN_NONBLOCK)
> >               return -ENOSPC;
> >   
> > -     if (ggtt_is_idle(dev_priv)) {
> > -             /* If we still have pending pageflip completions, drop
> > -              * back to userspace to give our workqueues time to
> > -              * acquire our locks and unpin the old scanouts.
> > -              */
> > -             return intel_has_pending_fb_unpin(dev_priv) ? -EAGAIN : -ENOSPC;
> > -     }
> > +     /*
> > +      * Not everything in the GGTT is tracked via VMA using
> > +      * i915_vma_move_to_active(), otherwise we could evict as required
> > +      * with minimal stalling. Instead we are forced to idle the GPU and
> > +      * explicitly retire outstanding requests which will then remove
> > +      * the pinning for active objects such as contexts and ring,
> > +      * enabling us to evict them on the next iteration.
> > +      *
> > +      * To ensure that all user contexts are evictable, we perform
> > +      * a switch to the perma-pinned kernel context. This all also gives
> > +      * us a termination condition, when the last retired context is
> > +      * the kernel's there is no more we can evict.
> > +      */
> > +     if (!ggtt_is_idle(dev_priv)) {
> > +             ret = ggtt_flush(dev_priv);
> > +             if (ret)
> > +                     return ret;
> >   
> > -     ret = ggtt_flush(dev_priv);
> > -     if (ret)
> > -             return ret;
> > +             goto search_again;
> > +     }
> >   
> > -     goto search_again;
> > +     /*
> > +      * If we still have pending pageflip completions, drop
> > +      * back to userspace to give our workqueues time to
> > +      * acquire our locks and unpin the old scanouts.
> > +      */
> > +     return intel_has_pending_fb_unpin(dev_priv) ? -EAGAIN : -ENOSPC;
> >   
> >   found:
> >       /* drm_mm doesn't allow any other other operations while
> > 
> 
> Looks like it will fix the bug and can't spot that it introduces a 
> problem. Was there an igt which was failing or any bugzilla?

No, there was an odd flip during shard testing of hugepages. Papered
over before we landed the hugepages work, but still deserved a real fix
for the bug uncovered.

I didn't succeed in writing an igt that didn't take too long or run out
of memory (8G ram trying to fill a 4G GGTT, we have some overcommit
issues ;). So I resorted to making the selftest instead, where we can
pretend like the GGTT is only a few megabytes to make the testing
tractable.
-Chris


More information about the Intel-gfx mailing list