[Intel-gfx] [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests

Tue Sep 26 13:03:11 UTC 2017

Quoting Mika Kuoppala (2017-09-26 13:48:17)
> Chris Wilson <chris at chris-wilson.co.uk> writes:
> 
> > If we see the seqno stop progressing, we abandon the test for fear that
> > the GPU died following the reset. However, during test teardown we still
> > wait for the GPU to idle before continuing, but we have already
> > confirmed that the GPU is dead. Furthermore, since we are inside a reset
> > test, we have disabled the hangchecker, and so there is no safety net and
> > we wait indefinitely. Detect the stuck GPU and declare it wedged as a
> > state of emergency so we can escape.
> >
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Jari Tahvanainen <jari.tahvanainen at intel.com>
> > Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> > ---
> >  drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 25 +++++++++++++++++++-----
> >  1 file changed, 20 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> > index 02e52a146ed8..913fe752f6b4 100644
> > --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> > +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> > @@ -165,6 +165,7 @@ static int emit_recurse_batch(struct hang *h,
> >               *batch++ = lower_32_bits(vma->node.start);
> >       }
> >       *batch++ = MI_BATCH_BUFFER_END; /* not reached */
> > +     wmb();
> >
> 
> Why not the big hammer with i915_gem_chipset_flush() here?

It didn't cross my mind, I was just doodling :)

> 
> >       flags = 0;
> >       if (INTEL_GEN(vm->i915) <= 5)
> > @@ -621,7 +622,12 @@ static int igt_wait_reset(void *arg)
> >       __i915_add_request(rq, true);
> >  
> >       if (!wait_for_hang(&h, rq)) {
> > -             pr_err("Failed to start request %x\n", rq->fence.seqno);
> > +             pr_err("Failed to start request %x, at %x\n",
> > +                    rq->fence.seqno, hws_seqno(&h, rq));
> > +
> > +             i915_reset(i915, 0);
> > +             i915_gem_set_wedged(i915);
> > +
> >               err = -EIO;
> >               goto out_rq;
> >       }
> > @@ -708,10 +714,14 @@ static int igt_reset_queue(void *arg)
> >                       __i915_add_request(rq, true);
> >  
> >                       if (!wait_for_hang(&h, prev)) {
> > -                             pr_err("Failed to start request %x\n",
> > -                                    prev->fence.seqno);
> > +                             pr_err("Failed to start request %x, at %x\n",
> > +                                    rq->fence.seqno, hws_seqno(&h, rq));
> 
> As you pointed out the debug in here is for wrong request.
> 
> Reviewed-by: Mika Kuoppala <mika.kuoppala at linux.intel.com>

Happy if I drop the wmb() for a later patch and replace it with a
chipset flush instead?
-Chris