[Intel-gfx] [PATCH] drm/i915: Fix up -EIO ABI per igt/gem_eio

Thu Nov 26 00:30:37 PST 2015

On Wed, Nov 25, 2015 at 05:52:57PM +0000, Chris Wilson wrote:
> On Wed, Nov 25, 2015 at 06:45:26PM +0100, Daniel Vetter wrote:
> > My apologies to Chris Wilson for being dense on this topic for
> > essentially for years.
> > 
> > This patch doesn't do any big redesign of the overall reset flow, but
> > instead just applies changes where it's needed to obey gem_eio. We can
> > redesign later on, but for now I prefer to make the testcase happy
> > with some minimally invasive changes:
> > 
> > - Ensure that set_domain_ioctl never returns -EIO. Tricky part there
> >   is that we might race with the reset handler when doing the lockless
> >   wait.
> > 
> > - Make sure debugfs/i915_wedged is actually useful to reset a wedged
> >   gpu - atm it bails out with -EAGAIN for a terminally wedged gpu.
> >   That's because reset_in_progress actually includes that terminal
> >   state, which is super-confusing (and most callers got this wrong).
> >   Fix this by changing the semantics of reset_in_progress to do what
> >   it says on the tin (and fixup all other callers).
> > 
> >   Also make sure that reset_in_progress is cleared when we reach the
> >   terminal "wedged" state. Without this we kept returning -EAGAIN in
> >   some places forever.
> > 
> > - Second problem with debugfs/i915_wedge was that we never got out of
> >   the terminal wedged state. Hence manually set the reset counter out
> >   of that state before starting the hung gpu recovery.
> > 
> >   For safety also make sure that we are consintent with resetting the
> >   gpu between i915_reset_and_wakeup and i915_handler_error by just
> >   passing the boolean "wedged" around instead of trying to recompute
> >   it wrongly. This isn't an issue for gem_eio with the debugfs fix,
> >   but just general robustness of the code.
> > 
> > - Finally make sure that when gpu reset is disabled through the module
> >   paramter the kernel doesn't generate dmesg noise that would upset
> >   our CI/piglit. Simplest solution for that was to lift the i915.reset
> >   check up into i915_reset().
> > 
> > With all these changes, plus the igt fixes I've posted, gem_eio is now
> > happy on many snb.
> > 
> > v2: Don't upset lockdep with my set_domain_ioctl changes.
> > 
> > Cc: Chris Wilson <chris at chris-wilson.co.uk>
> > Testcase: igt/gem_eio/*
> > Signed-off-by: Daniel Vetter <daniel.vetter at intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_debugfs.c |  4 ++++
> >  drivers/gpu/drm/i915/i915_drv.c     |  6 ++++++
> >  drivers/gpu/drm/i915/i915_drv.h     |  2 +-
> >  drivers/gpu/drm/i915/i915_gem.c     | 23 +++++++++++++++--------
> >  drivers/gpu/drm/i915/i915_irq.c     | 11 ++++++-----
> >  drivers/gpu/drm/i915/intel_uncore.c |  3 ---
> >  6 files changed, 32 insertions(+), 17 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> > index 411a9c68b4ee..c4006c09ef1b 100644
> > --- a/drivers/gpu/drm/i915/i915_debugfs.c
> > +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> > @@ -4744,6 +4744,10 @@ i915_wedged_set(void *data, u64 val)
> >  	if (i915_reset_in_progress(&dev_priv->gpu_error))
> >  		return -EAGAIN;
> >  
> > +	/* Already wedged, force us out of this terminal state. */
> > +	if (i915_terminally_wedged(&dev_priv->gpu_error))
> > +		atomic_set(&dev_priv->gpu_error.reset_counter, 0);
> 
> What?

At least with the current reset flow we won't get out of terminally
wedged. We could instead get out of the terminal state when we attempt
another gpu reset, but that kinda defeats the idea of wedge being
terminally. Hence why I placed this here.

> > +
> >  	intel_runtime_pm_get(dev_priv);
> >  
> >  	i915_handle_error(dev, val,
> > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> > index ec1e877c4a36..1f5386bb78f4 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.c
> > +++ b/drivers/gpu/drm/i915/i915_drv.c
> > @@ -909,6 +909,12 @@ int i915_reset(struct drm_device *dev)
> >  
> >  	simulated = dev_priv->gpu_error.stop_rings != 0;
> >  
> > +	if (!i915.reset) {
> > +		DRM_INFO("GPU reset disabled in module option, not resetting\n");
> > +		mutex_unlock(&dev->struct_mutex);
> > +		return -ENODEV;
> > +	}
> > +
> >  	ret = intel_gpu_reset(dev);
> 
> Please don't break intel_gpu_reset() like that.

It's an equivalent transformation, the only thing that changed is that we
check i915.reset 2 function calls up in the callchain. What exactly is
broken?

> I think this patch missed the point entirely.

It gets gem_eio going, and I'm not sold on reworking the overall reset
flow. First fix up the ABI, then fix up any remaining issues with the
reset flow. Or what exactly do you mean?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch