[Intel-gfx] [PATCH v2] drm/i915: Sleep and retry a GPU reset if at first we don't succeed

Chris Wilson chris at chris-wilson.co.uk
Fri Dec 1 13:33:12 UTC 2017


Quoting Mika Kuoppala (2017-12-01 13:20:29)
> Chris Wilson <chris at chris-wilson.co.uk> writes:
> 
> > As we declare the GPU wedged if the reset fails, such a failure is quite
> > terminal. Before taking that drastic action, let's sleep first and try
> > active, in the hope that the hardware has quietened down and is then
> > able to reset. After a few such attempts, it is fair to say that the HW
> > is truly wedged.
> >
> > v2: Always print the failure message now, we precheck whether resets are
> > disabled.
> >
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=104007
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> > Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_drv.c | 20 +++++++++++++++-----
> >  1 file changed, 15 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> > index e0f053f9c186..7faf20aff25a 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.c
> > +++ b/drivers/gpu/drm/i915/i915_drv.c
> > @@ -1877,7 +1877,9 @@ void i915_reset(struct drm_i915_private *i915, unsigned int flags)
> >  {
> >       struct i915_gpu_error *error = &i915->gpu_error;
> >       int ret;
> > +     int i;
> >  
> > +     might_sleep();
> >       lockdep_assert_held(&i915->drm.struct_mutex);
> >       GEM_BUG_ON(!test_bit(I915_RESET_BACKOFF, &error->flags));
> >  
> > @@ -1900,12 +1902,20 @@ void i915_reset(struct drm_i915_private *i915, unsigned int flags)
> >               goto error;
> >       }
> >  
> > -     ret = intel_gpu_reset(i915, ALL_ENGINES);
> > +     if (!intel_has_gpu_reset(i915)) {
> > +             DRM_DEBUG_DRIVER("GPU reset disabled\n");
> > +             goto error;
> > +     }
> > +
> > +     for (i = 0; i < 3; i++) {
> > +             ret = intel_gpu_reset(i915, ALL_ENGINES);
> > +             if (ret == 0)
> > +                     break;
> > +
> > +             msleep(100);
> 
> Seems reasonable to try few times and pause between defibrillate
> attempts instead of throwing dirt on top of coffin right
> off the bat.
> 
> Also I have been pondering that should we add a minicheck
> to intel_gpu_reset to poke that the gpu is really there.
> Like doing few nops in (render)ringbuffer and see if head
> moves before declaring it as a reset success?

That's kind of what the recovery tests. It's tricky to do so because we
would have to unwind hw state after the trial, plus we may not see a
problem until it tries to do something like write to HWSP.

> Not that we would not see it in init right after but just
> to have more precise location of failure instead of
> initing a dead gpu.

It's dead, Jim. Either way, the location should be in the post-mortem
error state capture before the first attempt to reset. As always, if
that isn't enough to diagnose the problem, we need to work harder to get
the right information into that dump. Tricky.
-Chris


More information about the Intel-gfx mailing list