[Intel-gfx] [PATCH] drm/i915: kicking rings considered harmful

Wed Sep 28 03:34:31 CEST 2011

On Tue, 27 Sep 2011 22:54:01 +0100
Chris Wilson <chris at chris-wilson.co.uk> wrote:

> On Tue, 27 Sep 2011 12:38:59 -0700, Ben Widawsky <ben at bwidawsk.net> wrote:
> > If we do this we lose the possibility to kick rings, but not reset the
> > GPU (not that I find that terribly useful. If we do this, it does fire a
> > wq event, but I don't see a problem with that for this case.
> > 
> > I think I would rather do this:
> > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> > index 012732b..803524e 100644
> > --- a/drivers/gpu/drm/i915/i915_irq.c
> > +++ b/drivers/gpu/drm/i915/i915_irq.c
> > @@ -1698,6 +1698,10 @@ void i915_hangcheck_elapsed(unsigned long data)
> >                 if (dev_priv->hangcheck_count++ > 1) {
> >                         DRM_ERROR("Hangcheck timer elapsed... GPU hung\n");
> >  
> > +                       /* Save off error state before kicking the rings and
> > +                        * possibly ruining the GPU state.
> > +                        */
> > +                       i915_handle_error(dev, true);
> >                         if (!IS_GEN2(dev)) {
> >                                 /* Is the chip hanging on a WAIT_FOR_EVENT?
> >                                  * If so we can simply poke the RB_WAIT bit
> > @@ -1717,7 +1721,6 @@ void i915_hangcheck_elapsed(unsigned long data)
> >                                         goto repeat;
> >                         }
> >  
> > -                       i915_handle_error(dev, true);
> >                         return;
> >                 }
> >         } else {
> 
> Interesting, if we simply call i915_capture_error_state() rather than move
> the i195_handle_error() earlier we do in fact get the best of both worlds.

We can do this except i915_handle_error() is called i915_driver_irq_handler, so
we have to modify that as well. But yeah, I'm fine with that too, though I
don't think it makes much difference either way.

> 
> However, it doesn't address Daniel's statement that kick_rings() provoked
> an unrecoverable hang and so we still need to disable that in order to
> save the error-state. The origin of ring-kicking was to try and recover
> from the modesetting/vsync issues, which apart from the outstanding issue
> in intel_crtc_disable() are behind us. (I hope ;-) We shouldn't be relying
> on i915_reset() and i915.reset=0 tends to be either deliberate or an act of
> desparation so I don't see the issue in also preventing ring-kicking with
> the same parameter. Is there an issue I'm overlooking?

No issue, I just feel that this is redundant to hangcheck_enable, so to me at
least, this just adds extra confusion to an already confusing situation. I seem
to be in the minority though.

To me it's:
reset=0, don't ever try to reset
enable_hangcheck=0, don't ever check if we're hung (ie. don't reset or kick)

And now it's
reset=0, don't every try to reset or kick
enabled_hangcheck=0, don't ever check if we're hung (ie. don't reset or kick)

I'd definitely be in favo(u)r of removing the kick_ring() if it isn't really
useful anymore. It has some forcewake race if I remember correctly which I
never bothered to fix.

> -Chris
> 

Ben