[Intel-gfx] [PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain

Wed May 1 14:40:43 CEST 2013

On Wed, May 1, 2013 at 1:06 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> On Wed, May 01, 2013 at 12:38:27PM +0200, Daniel Vetter wrote:
>> On Wed, May 1, 2013 at 12:25 PM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
>> > If reset fails, the GPU is declared wedged. This ideally should never
>> > happen, but very rarely it does. After the GPU is declared wedged, we
>> > must allow userspace to continue to use its mapping of bo in order to
>> > recover its data (and in some cases in order for memory management to
>> > continue unabated). Obviously after the GPU is wedged, no bo are
>> > currently accessed by the GPU and so we can complete any waits or domain
>> > transitions away from the GPU. Currently, we fail this essential task
>> > and instead report EIO and send a SIGBUS to the affected process -
>> > causing major loss of data (by killing X or compiz).
>> >
>> > References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
>> > References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
>> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>
>> So I've read again through the reset code and I still don't see how
>> wait_rendering can ever gives us -EIO once the gpu is dead. So all the
>> -EIO eating after wait_rendering looks really suspicious to me.
>
> That's always been a layer of defense against the driver. Alternatively,
> throw we can throw a warn into the wait as we shouldn't enter there with
> a seqno and a wedged GPU.

Yeah, that sounds like a good idea. We really shouldn't ever have an
oustanding request while the gpu is wedged (ignoring dangerous changes
to the wedged state through debugfs).

>> Now the other thing is i915_gem_object_wait_
>> rendering, that thing loves to throw an -EIO at us. And on a quick
>> check your patch misses the one in set_domain_ioctl. We probably need
>> to do the same with sw_finish_ioctl. So what about a
>> i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate
>> the few places we don't want to hear about a dead gpu?
>
> In fact, it turns out that we hardly ever want
> i915_mutex_lock_interruptible to return -EIO. Long ago, the idea was
> that EIO was only ever returned when we tried to issue a command on the
> GPU whilst it was terminally wedged - in order to preserve data
> integrity.

Don't we need to re-add a check in execbuf then? Otherwise I guess
userspace has no idea ever that something is amiss ...

Otherwise I think this approach would also make sense, and feels more
future-proof. Applications that want real robustness simply need to
query the kernel through the new interface whether anything ugly has
happened to them. And if we want more synchronous feedback we can
always improve wait/set_domain_ioctl to check whether the given bo
suffered through a little calamity.

> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 2bd8d7a..f243e32 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -97,7 +97,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
>
>         /* GPU is already declared terminally dead, give up. */
>         if (i915_terminally_wedged(error))
> -               return -EIO;
> +               return 0;
>
>         /*
>          * Only wait 10 seconds for the gpu reset to complete to avoid hanging
> @@ -109,13 +109,12 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
>                                                10*HZ);
>         if (ret == 0) {
>                 DRM_ERROR("Timed out waiting for the gpu reset to complete\n");
> -               return -EIO;
> -       } else if (ret < 0) {
> -               return ret;
> -       }
> +               atomic_set(&error->reset_counter, I915_WEDGED);
> +       } else if (ret > 0)
> +               ret = 0;

Less convinced about this hunk here. I think the right approach would
be to simply kill the timeout. Unlucky users can always get out of
here with a signal, and it's imo more robust if we have all relevant
timeouts for the reset process in the reset work. Separate patch
though. Also we now have quite good coverage of all the reset handling
in igt, so I don't think we need to go overboard in defending against
our own logic screw-ups ... Reset really should work nowadays.

Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch