[Intel-gfx] [PATCH] drm/i915: kicking rings considered harmful

Tue Sep 27 21:38:59 CEST 2011

On Tue, 27 Sep 2011 20:03:17 +0200
Daniel Vetter <daniel at ffwll.ch> wrote:

> On Tue, Sep 27, 2011 at 06:31:59PM +0100, Chris Wilson wrote:
> > On Tue, 27 Sep 2011 09:46:14 -0700, Ben Widawsky <ben at bwidawsk.net> wrote:
> > > On Tue, 27 Sep 2011 12:03:22 +0200
> > > Daniel Vetter <daniel at ffwll.ch> wrote:
> > > 
> > > > On Mon, Sep 26, 2011 at 10:22:01PM -0700, Ben Widawsky wrote:
> > > > > On Mon, 26 Sep 2011 19:59:50 +0200
> > > > > Daniel Vetter <daniel.vetter at ffwll.ch> wrote:
> > > > > > diff --git a/drivers/gpu/drm/i915/i915_irq.c
> > > > > > b/drivers/gpu/drm/i915/i915_irq.c index da5d607..09c11e4 100644
> > > > > > --- a/drivers/gpu/drm/i915/i915_irq.c
> > > > > > +++ b/drivers/gpu/drm/i915/i915_irq.c
> > > > > > @@ -1694,7 +1694,7 @@ void i915_hangcheck_elapsed(unsigned long data)
> > > > > >  		if (dev_priv->hangcheck_count++ > 1) {
> > > > > >  			DRM_ERROR("Hangcheck timer elapsed... GPU
> > > > > > hung\n"); 
> > > > > > -			if (!IS_GEN2(dev)) {
> > > > > > +			if (!IS_GEN2(dev) && i915_try_reset) {
> > > > > >  				/* Is the chip hanging on a
> > > > > > WAIT_FOR_EVENT?
> > > > > >  				 * If so we can simply poke the
> > > > > > RB_WAIT bit
> > > > > >  				 * and break the hang. This should
> > > > > > work on
> > > > > 
> > > > > I think you should also be able to accomplish the same thing
> > > > > with enable_hangcheck param. I had the same problem with the
> > > > > debugger :)
> > > > 
> > > > I agree. Iirc you have some patches floating in that area to make the
> > > > hangcheck a bit more robust. Can you maybe add this to that series and
> > > > (re-)submit?
> > > > 
> > > > Cheers, Daniel
> > > 
> > > While 9/10 times daniel > ben, I'm playing my 10% card here and
> > > suggesting that mixing the reset variable and ring kick is not the right
> > > way to go about this.
> > 
> > One purpose of the i915.reset parameter is to disable any automatic
> > attempts to recover from a hang condition so that the error state is not
> > misleading. So preventing the kick ring does help in that regard.
> > 
> > A second purpose is to prevent i915_reset() from causing havoc and hanging
> > the machine. Daniel is implying that kicking the rings is instrumental in
> > making matters worse. Again using i915.reset to prevent kicking the rings
> > fits in with that purpose.
> > 
> > Since I regard kicking rings as a form of reset, I don't see it as a
> > conflation of terms and so a valid use of i915.reset.
> 
> Couldn't have said it any better. The bad effects of kicking stuck rings
> is mostly that when we have a sync problem there's a decent chance
> somebody has written garbage into our batchbuffers. Continously trying to
> execute said garbage is just tempting faith in the gpu's error resilience.
> -Daniel

If we do this we lose the possibility to kick rings, but not reset the
GPU (not that I find that terribly useful. If we do this, it does fire a
wq event, but I don't see a problem with that for this case.

I think I would rather do this:

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 012732b..803524e 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1698,6 +1698,10 @@ void i915_hangcheck_elapsed(unsigned long data)
                if (dev_priv->hangcheck_count++ > 1) {
                        DRM_ERROR("Hangcheck timer elapsed... GPU hung\n");
 
+                       /* Save off error state before kicking the rings and
+                        * possibly ruining the GPU state.
+                        */
+                       i915_handle_error(dev, true);
                        if (!IS_GEN2(dev)) {
                                /* Is the chip hanging on a WAIT_FOR_EVENT?
                                 * If so we can simply poke the RB_WAIT bit
@@ -1717,7 +1721,6 @@ void i915_hangcheck_elapsed(unsigned long data)
                                        goto repeat;
                        }
 
-                       i915_handle_error(dev, true);
                        return;
                }
        } else {