[Intel-gfx] [PATCH] drm/i915: Prevent machine hang from Broxton's vtd w/a and error capture

Daniel Vetter daniel at ffwll.ch
Wed Dec 6 14:51:06 UTC 2017


On Wed, Dec 06, 2017 at 02:48:36PM +0000, Chris Wilson wrote:
> Quoting Daniel Vetter (2017-12-06 14:43:39)
> > On Wed, Dec 06, 2017 at 02:19:03PM +0000, Chris Wilson wrote:
> > > Since capturing the error state requires fiddling around with the GGTT
> > > to read arbitrary buffers and is itself run under stop_machine(), it
> > > deadlocks the machine (effectively a hard hang) when run in conjunction
> > > with Broxton's VTd workaround to serialize GGTT access.
> > > 
> > > Fixes: 0ef34ad6222a ("drm/i915: Serialize GTT/Aperture accesses on BXT")
> > > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > > Cc: Jon Bloomfield <jon.bloomfield at intel.com>
> > > Cc: John Harrison <john.C.Harrison at intel.com>
> > > Cc: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> > > Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/i915_gpu_error.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> > > index 48418fb81066..e6c7e8e53815 100644
> > > --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> > > +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> > > @@ -1813,6 +1813,10 @@ void i915_capture_error_state(struct drm_i915_private *dev_priv,
> > >       if (!i915_modparams.error_capture)
> > >               return;
> > >  
> > > +     /* Prevent recursively calling stop_machine() and deadlocking. */
> > > +     if (intel_ggtt_update_needs_vtd_wa(dev_priv))
> > > +             return;
> > 
> > I'd put this closer to the stop machine, at the head of
> > i915_capture_gpu_state(). If the bogus debug output annoys then we could
> > switch that to an PTR_ERR return value I guess. But I guess this here is
> > ok too, so either way:
> 
> I was considering doing some of the capture, skipping the buffers, but
> nowadays those buffers tend to the crux of triaging. My only real concern
> is how to explain to the user that the error state cannot exist, for 
> which we could go and add -ENODEV to sysfs/debugfs just to be clear.

Fancy idea: store ther PTR_ERR in ->first.error and return that? Would
address both my bikeshed and your suggestion.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


More information about the Intel-gfx mailing list