[Intel-gfx] [PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain

Wed May 1 15:01:02 CEST 2013

On Wed, May 01, 2013 at 02:40:43PM +0200, Daniel Vetter wrote:
> On Wed, May 1, 2013 at 1:06 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> > On Wed, May 01, 2013 at 12:38:27PM +0200, Daniel Vetter wrote:
> >> On Wed, May 1, 2013 at 12:25 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> >> > If reset fails, the GPU is declared wedged. This ideally should never
> >> > happen, but very rarely it does. After the GPU is declared wedged, we
> >> > must allow userspace to continue to use its mapping of bo in order to
> >> > recover its data (and in some cases in order for memory management to
> >> > continue unabated). Obviously after the GPU is wedged, no bo are
> >> > currently accessed by the GPU and so we can complete any waits or domain
> >> > transitions away from the GPU. Currently, we fail this essential task
> >> > and instead report EIO and send a SIGBUS to the affected process -
> >> > causing major loss of data (by killing X or compiz).
> >> >
> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
> >> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >>
> >> So I've read again through the reset code and I still don't see how
> >> wait_rendering can ever gives us -EIO once the gpu is dead. So all the
> >> -EIO eating after wait_rendering looks really suspicious to me.
> >
> > That's always been a layer of defense against the driver. Alternatively,
> > throw we can throw a warn into the wait as we shouldn't enter there with
> > a seqno and a wedged GPU.
> 
> Yeah, that sounds like a good idea. We really shouldn't ever have an
> oustanding request while the gpu is wedged (ignoring dangerous changes
> to the wedged state through debugfs).

Except it is only true for a locked wait. :(

> >> Now the other thing is i915_gem_object_wait_
> >> rendering, that thing loves to throw an -EIO at us. And on a quick
> >> check your patch misses the one in set_domain_ioctl. We probably need
> >> to do the same with sw_finish_ioctl. So what about a
> >> i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate
> >> the few places we don't want to hear about a dead gpu?
> >
> > In fact, it turns out that we hardly ever want
> > i915_mutex_lock_interruptible to return -EIO. Long ago, the idea was
> > that EIO was only ever returned when we tried to issue a command on the
> > GPU whilst it was terminally wedged - in order to preserve data
> > integrity.
> 
> Don't we need to re-add a check in execbuf then? Otherwise I guess
> userspace has no idea ever that something is amiss ...

It will catch the EIO as soon as it tries to emit the command, but
adding the explicit check will save a ton of work in reserving buffers.

> Otherwise I think this approach would also make sense, and feels more
> future-proof. Applications that want real robustness simply need to
> query the kernel through the new interface whether anything ugly has
> happened to them. And if we want more synchronous feedback we can
> always improve wait/set_domain_ioctl to check whether the given bo
> suffered through a little calamity.

The only explicit point we have to remember is to preserve the
throttle() semantics of reporting EIO when wedged.

> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > index 2bd8d7a..f243e32 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.c
> > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > @@ -97,7 +97,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
> >
> >         /* GPU is already declared terminally dead, give up. */
> >         if (i915_terminally_wedged(error))
> > -               return -EIO;
> > +               return 0;
> >
> >         /*
> >          * Only wait 10 seconds for the gpu reset to complete to avoid hanging
> > @@ -109,13 +109,12 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
> >                                                10*HZ);
> >         if (ret == 0) {
> >                 DRM_ERROR("Timed out waiting for the gpu reset to complete\n");
> > -               return -EIO;
> > -       } else if (ret < 0) {
> > -               return ret;
> > -       }
> > +               atomic_set(&error->reset_counter, I915_WEDGED);
> > +       } else if (ret > 0)
> > +               ret = 0;
> 
> Less convinced about this hunk here. I think the right approach would
> be to simply kill the timeout. Unlucky users can always get out of
> here with a signal, and it's imo more robust if we have all relevant
> timeouts for the reset process in the reset work. Separate patch
> though. Also we now have quite good coverage of all the reset handling
> in igt, so I don't think we need to go overboard in defending against
> our own logic screw-ups ... Reset really should work nowadays.

Apart from #63291 said otherwise (iirc)... I think several layers of
paranoia over handling GPU and driver hangs is justified.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre