[Intel-gfx] [PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain

Wed May 1 12:38:27 CEST 2013

On Wed, May 1, 2013 at 12:25 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> If reset fails, the GPU is declared wedged. This ideally should never
> happen, but very rarely it does. After the GPU is declared wedged, we
> must allow userspace to continue to use its mapping of bo in order to
> recover its data (and in some cases in order for memory management to
> continue unabated). Obviously after the GPU is wedged, no bo are
> currently accessed by the GPU and so we can complete any waits or domain
> transitions away from the GPU. Currently, we fail this essential task
> and instead report EIO and send a SIGBUS to the affected process -
> causing major loss of data (by killing X or compiz).
>
> References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
> References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>

So I've read again through the reset code and I still don't see how
wait_rendering can ever gives us -EIO once the gpu is dead. So all the
-EIO eating after wait_rendering looks really suspicious to me.

Now the other thing is i915_gem_object_wait_
rendering, that thing loves to throw an -EIO at us. And on a quick
check your patch misses the one in set_domain_ioctl. We probably need
to do the same with sw_finish_ioctl. So what about a
i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate
the few places we don't want to hear about a dead gpu?

And if the chances of us breaking bo waiting are too high we can
always add a few crazy igts which manually wedge the gpu to test them
and ensure they all work.

Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch