[Intel-gfx] [PATCH] drm/i915: Fix spurious -EIO/SIGBUS on wedged gpus

Tue May 28 11:22:20 CEST 2013

On Fri, May 24, 2013 at 10:03:14PM +0100, Chris Wilson wrote:
> On Fri, May 24, 2013 at 09:29:32PM +0200, Daniel Vetter wrote:
> > Chris Wilson noticed that since
> > 
> > commit 1f83fee08d625f8d0130f9fe5ef7b17c2e022f3c [v3.9]
> > Author: Daniel Vetter <daniel.vetter at ffwll.ch>
> > Date:   Thu Nov 15 17:17:22 2012 +0100
> > 
> >     drm/i915: clear up wedged transitions
> > 
> > X can again get -EIO when it does not expect it. And even worse score
> > a SIGBUS when accessing gtt mmaps. The established ABI is that we
> > _only_ return an -EIO from execbuf - all other ioctls should just
> > work. And since the reset code moves all bos out of gpu domains and
> > clears out all the last_seqno/ring tracking there really shouldn't be
> > any reason for non-execbuf code to ever touch the hw and see an -EIO.
> > 
> > After some extensive discussions we've noticed that these spurios -EIO
> > are caused by i915_gem_wait_for_error:
> > 
> > http://www.mail-archive.com/intel-gfx@lists.freedesktop.org/msg20540.html
> > 
> > That is easy to fix by returning 0 instead of -EIO, since grabbing the
> > dev->struct_mutex does not yet mean that we actually want to touch the
> > hw. And so there is no reason at all to fail with -EIO.
> > 
> > But that's not the entire since, since often (at least it's easily
> > googleable) dmesg indicates that the reset fails and we declare the
> > gpu wedged. Then, quite a bit later X wakes up with the "Timed out
> > waiting for the gpu reset to complete" DRM_ERROR message in
> > wait_for_errror and brings down the desktop with an -EIO/SIGBUS.
> > 
> > So clearly we're missing a wakeup somewhere, since the gpu reset just
> > doesn't take 10 seconds to complete. And indeed we're do handle the
> > terminally wedged state wrong.
> > 
> > Fix this all up.
> > 
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
> > Cc: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Daniel Vetter <daniel.vetter at ffwll.ch>
> > Cc: Damien Lespiau <damien.lespiau at intel.com>
> > Cc: stable at vger.kernel.org
> > Signed-off-by: Daniel Vetter <daniel.vetter at ffwll.ch>
> 
> Definite woosh. I feel silly for missing that.
> Reviewed-by: Chris Wilson <chris at chris-wilson.co.uk>

Merged to -fixes, thanks for the review.

> I still think there is a risk for the non-blocking wait to return an
> EIO and papering it over is the simplest approach. The chance that
> anyone will ever hit is minimal, and fortunately an EIO should never
> actually cause an application with adequate error handling to crash, so
> something that we can discuss at leisure.

Yeah, now that we have a less hand-wavey explanation for those -EIO we can
forget about the reset timeout until the next user screams about X dying
untimely ;-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch