[Intel-gfx] [PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain
Chris Wilson
chris at chris-wilson.co.uk
Wed May 1 13:06:15 CEST 2013
On Wed, May 01, 2013 at 12:38:27PM +0200, Daniel Vetter wrote:
> On Wed, May 1, 2013 at 12:25 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> > If reset fails, the GPU is declared wedged. This ideally should never
> > happen, but very rarely it does. After the GPU is declared wedged, we
> > must allow userspace to continue to use its mapping of bo in order to
> > recover its data (and in some cases in order for memory management to
> > continue unabated). Obviously after the GPU is wedged, no bo are
> > currently accessed by the GPU and so we can complete any waits or domain
> > transitions away from the GPU. Currently, we fail this essential task
> > and instead report EIO and send a SIGBUS to the affected process -
> > causing major loss of data (by killing X or compiz).
> >
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>
> So I've read again through the reset code and I still don't see how
> wait_rendering can ever gives us -EIO once the gpu is dead. So all the
> -EIO eating after wait_rendering looks really suspicious to me.
That's always been a layer of defense against the driver. Alternatively,
throw we can throw a warn into the wait as we shouldn't enter there with
a seqno and a wedged GPU.
> Now the other thing is i915_gem_object_wait_
> rendering, that thing loves to throw an -EIO at us. And on a quick
> check your patch misses the one in set_domain_ioctl. We probably need
> to do the same with sw_finish_ioctl. So what about a
> i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate
> the few places we don't want to hear about a dead gpu?
In fact, it turns out that we hardly ever want
i915_mutex_lock_interruptible to return -EIO. Long ago, the idea was
that EIO was only ever returned when we tried to issue a command on the
GPU whilst it was terminally wedged - in order to preserve data
integrity.
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 2bd8d7a..f243e32 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -97,7 +97,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
/* GPU is already declared terminally dead, give up. */
if (i915_terminally_wedged(error))
- return -EIO;
+ return 0;
/*
* Only wait 10 seconds for the gpu reset to complete to avoid hanging
@@ -109,13 +109,12 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
10*HZ);
if (ret == 0) {
DRM_ERROR("Timed out waiting for the gpu reset to complete\n");
- return -EIO;
- } else if (ret < 0) {
- return ret;
- }
+ atomic_set(&error->reset_counter, I915_WEDGED);
+ } else if (ret > 0)
+ ret = 0;
#undef EXIT_COND
- return 0;
+ return ret;
}
int i915_mutex_lock_interruptible(struct drm_device *dev)
> And if the chances of us breaking bo waiting are too high we can
> always add a few crazy igts which manually wedge the gpu to test them
> and ensure they all work.
--
Chris Wilson, Intel Open Source Technology Centre
More information about the Intel-gfx
mailing list