[Intel-gfx] [PATCH v2] drm/i915/selftests: Include the trace as a debug aide

Thu Mar 22 19:29:16 UTC 2018

On Thu, Mar 22, 2018 at 02:30:09PM +0000, Chris Wilson wrote:
> Quoting Mika Kuoppala (2018-03-22 14:26:41)
> > Chris Wilson <chris at chris-wilson.co.uk> writes:
> > 
> > > If we fail to reset the GPU in a timely fashion, dump the GEM trace so
> > > that we can see what operations were in flight when the GPU got stuck.
> > >
> > > v2: There's more than one timeout that deserves tracing!
> > > v3: Silence checkpatch by not even using a product at all!
> > >
> > > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > > ---
> > >  drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 23 ++++++++++++++++++++---
> > >  1 file changed, 20 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> > > index 4372826998aa..9b235dae8dd9 100644
> > > --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> > > +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> > > @@ -260,8 +260,11 @@ static void wedge_me(struct work_struct *work)
> > >  {
> > >       struct wedge_me *w = container_of(work, typeof(*w), work.work);
> > >  
> > > -     pr_err("%pS timed out, cancelling all further testing.\n",
> > > -            w->symbol);
> > > +     pr_err("%pS timed out, cancelling all further testing.\n", w->symbol);
> > > +
> > > +     GEM_TRACE("%pS timed out.\n", w->symbol);
> > > +     GEM_TRACE_DUMP();
> > > +
> > >       i915_gem_set_wedged(w->i915);
> > >  }
> > >  
> > > @@ -621,9 +624,19 @@ static int active_engine(void *data)
> > >               mutex_unlock(&engine->i915->drm.struct_mutex);
> > >  
> > >               if (old) {
> > > -                     i915_request_wait(old, 0, MAX_SCHEDULE_TIMEOUT);
> > > +                     if (i915_request_wait(old, 0, HZ) < 0) {
> > > +                             GEM_TRACE("%s timed out.\n", engine->name);
> > > +                             GEM_TRACE_DUMP();
> > > +
> > > +                             i915_gem_set_wedged(engine->i915);
> > > +                             i915_request_put(old);
> > > +                             err = -EIO;
> > > +                             break;
> > > +                     }
> > 
> > Using err = i915_request_wait() could have saved one extra request_put
> > but I dunno if it would be any cleaner.
> 
> It's also -ETIME, which didn't fit my intention.
> 
> > 
> > >                       i915_request_put(old);
> > >               }
> > > +
> > > +             cond_resched();
> > 
> > To give more slack for other engines and main thread to proceed?
> 
> Yes. Otherwise, it spins mighty fine.
> > 
> > >       }
> > >  
> > >       for (count = 0; count < ARRAY_SIZE(rq); count++)
> > > @@ -1126,6 +1139,10 @@ int intel_hangcheck_live_selftests(struct drm_i915_private *i915)
> > >  
> > >       err = i915_subtests(tests, i915);
> > >  
> > > +     mutex_lock(&i915->drm.struct_mutex);
> > > +     flush_test(i915, I915_WAIT_LOCKED);
> > > +     mutex_unlock(&i915->drm.struct_mutex);
> > > +
> > 
> > To wash out leftovers.
> 
> Yeah, from the early abort we left requests unaccounted for and needed
> to grab the struct_mutex to run retire. Otherwise we hit assertions
> later on about trying to unload the driver with requests still inflight.
> -Chris

On this and the 3 others in this series...

Reviewed-by: Jeff McGee <jeff.mcgee at intel.com>