[Intel-gfx] [PATCH] drm/i915: Treat i915_reset_engine() as guilty until proven innocent

Fri Apr 6 21:30:41 UTC 2018

Quoting Michel Thierry (2018-04-06 22:23:21)
> And I thought we believed in presumption of innocence...
> 
> On 4/6/2018 2:00 PM, Chris Wilson wrote:
> > If we are resetting just one engine, we know it has stalled. So we can
> > pass the stalled parameter directly to i915_gem_reset_engine(), which
> > alleviates the necessity to poke at the generic engine->hangcheck.stalled
> > magic variable, leaving that under control of hangcheck as its name
> > implies. Other than simplifying by removing the indirect parameter along
> > this path, this allows us to introduce new reset mechanisms that run
> > independently of hangcheck.
> > > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Michel Thierry <michel.thierry at intel.com>
> > Cc: Jeff McGee <jeff.mcgee at intel.com>
> > Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_drv.c               |  2 +-
> >   drivers/gpu/drm/i915/i915_drv.h               |  3 +-
> >   drivers/gpu/drm/i915/i915_gem.c               | 36 +++++++++----------
> >   .../gpu/drm/i915/selftests/intel_hangcheck.c  |  9 -----
> >   4 files changed, 20 insertions(+), 30 deletions(-)
> > 
> ...
> > @@ -774,7 +766,6 @@ static int __igt_reset_engines(struct drm_i915_private *i915,
> >                               break;
> >                       }
> >   
> > -                     engine->hangcheck.stalled = false;
> >                       count++;
> >   
> >                       if (rq) {
> > 
> 
> Are the ones in igt_handle_error() still needed?
>    hangcheck.stalled = true;
>    hangcheck.seqno = intel_engine_get_seqno(engine);
> 
> Because igt_handle_error is sending a real request.

> (I think the only ones remaining in the selftest should be in 
> fake_hangcheck).

Right, fake_hangcheck definitely still needs it to behave like
hangcheck.

i915_handle_error() is still "odd". At the moment, yes we still need to
be poking where we shouldn't. If i915_handle_error() uses the
engine_mask to do per-engine resets, no, we don't need the magic
hangcheck.stalled. But, if it falls back to the full device level, we
loose the guilty reset.  So we do get a difference in behaviour, that
really hasn't been noticed before as the only real caller is from
hangcheck. (i915_wedged, I dare anyone to say what they expect ;)

I think the answer will be to pass engine_mask to i915_reset. But I
haven't fleshed that out yet. I think it means we do away with
hangcheck.seqno as well, so bonus?
-Chris