[Intel-gfx] [PATCH] drm/i915: Show stack (by WARN) for hitting forcewake errors

Tue Jul 24 10:59:07 UTC 2018

Quoting Tvrtko Ursulin (2018-07-24 11:33:47)
> 
> On 20/07/2018 12:11, Chris Wilson wrote:
> > On Sandybridge, we need a workaround to wait for the CPU thread to wake
> > up before we are sure that we have enabled the GT power well. However,
> > we do see the errors being reported and failed reads returning spurious
> > results. To try and capture more details as it fails, promote the error
> > into a WARN so we grab the stacktrace, and to try and reduce the
> > frequency of error increase the timeout from 500us to 5ms.
> > 
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > ---
> >   drivers/gpu/drm/i915/intel_uncore.c | 18 ++++++++++++++----
> >   1 file changed, 14 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> > index b892ca8396e8..284be151f645 100644
> > --- a/drivers/gpu/drm/i915/intel_uncore.c
> > +++ b/drivers/gpu/drm/i915/intel_uncore.c
> > @@ -283,14 +283,24 @@ fw_domains_reset(struct drm_i915_private *i915,
> >               fw_domain_reset(i915, d);
> >   }
> >   
> > +static inline u32 gt_thread_status(struct drm_i915_private *dev_priv)
> > +{
> > +     u32 val;
> > +
> > +     val = __raw_i915_read32(dev_priv, GEN6_GT_THREAD_STATUS_REG);
> > +     val &= GEN6_GT_THREAD_STATUS_CORE_MASK;
> > +
> > +     return val;
> > +}
> > +
> >   static void __gen6_gt_wait_for_thread_c0(struct drm_i915_private *dev_priv)
> >   {
> > -     /* w/a for a sporadic read returning 0 by waiting for the GT
> > +     /*
> > +      * w/a for a sporadic read returning 0 by waiting for the GT
> >        * thread to wake up.
> >        */
> > -     if (wait_for_atomic_us((__raw_i915_read32(dev_priv, GEN6_GT_THREAD_STATUS_REG) &
> > -                             GEN6_GT_THREAD_STATUS_CORE_MASK) == 0, 500))
> > -             DRM_ERROR("GT thread status wait timed out\n");
> > +     WARN_ONCE(wait_for_atomic_us(gt_thread_status(dev_priv) == 0, 5000),
> > +               "GT thread status wait timed out\n");
> >   }
> >   
> >   static void fw_domains_get_with_thread_status(struct drm_i915_private *dev_priv,
> > 
> 
> Quite a large increase - but I guess nothing we can do if it's hit, nor 
> it's better to time out faster if the wait needs to be longer.

It was a bit of a jump, but I felt that if we are prematurely declaring
the machine dead, we end up killing it anyway.

A long, long time ago, when we saw these errors it was terminal. The
recent sightings in CI however looked transient (hard to tell as we
panicked shortly after), so I'm betting on a longer timeout will help
the false positives (additional hardirq latency is better than death).
-Chris