[Intel-gfx] [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip

Wed Jun 17 08:07:31 PDT 2015

On Wed, Jun 17, 2015 at 03:16:10PM +0100, Chris Wilson wrote:
> We have gone far off topic.
> 
> The question is how we want __i915_wait_request() to handle a wedged
> GPU.
> 
> It currently reports EIO, and my argument is that is wrong wrt to the
> semantics of the wait completion and that no caller actually cares about
> EIO from __i915_wait_request().
> 
> * Correction: one caller cares!
> 
> If we regard a wedged GPU (and in the short term a reset is equally
> terminal to an outstanding request) then the GPU can no longer be
> accesing thta request and the wait can be safely completed. Imo it is
> correct to return 0 in all circumstances. (Reset pending needs to return
> -EAGAIN if we need to backoff, but for the lockless consumers we can
> just ignore the reset notification.
> 
> That is set-domain, mmioflip, modesetting do not care if the request
> succeeded, just that it completed.
> 
> Throttle() has an -EIO in its ABI for reporting a wedged GPU - this is
> used by X to detect when the GPU is unusable prior to use, e.g. when
> waking up, and also during its periodic flushes.
> 
> Overlay reports -EIO when turning on and hanging the GPU. To be fair, it
> can equally report that failure the very next time it touches the ring.
> 
> Execbuf itself doesn't rely on wait request reporting EIO, just that we
> report EIO prior to submitting work o a dead GPU/context. Execbuf uses
> wait_request via two paths, syncing to an old request on another ring
> and for flushing requests from the ringbuffer to make room for new
> commands. This is the tricky part, the only instance where we rely on
> aborting after waiting but before further operation - we won't even
> notice a dead GPU prior to starting a request and running out of space
> otherwise). Since it is the only instance, we can move the terminal
> detection of a dead GPU from the wait request into the
> ring_wait_for_space(). This is in keeping with the ethos that we do not
> report -EIO until we attempt to access the GPU.

Ok, following up with my side of the irc discussion we've had. I agree
that there's only 2 places where we must report an EIO if the gpu is
terminally wedge:
- throttle
- execbuf

How that's done doesn't matter, and when it's racy wrt concurrent gpu
deaths that also doens't matter, i.e. we don't need wait_request to EIO
immediately as long as we check terminally_wedged somewhere in these
ioctls.

My main concern is that if we remove the EIO from wait_request we'll
accidentally also remove the EIO from execbuf. And we've had kernels where
the only EIO left was the wait_request from ring_begin ...

But if we add a small igt to manually wedge the gpu through debugfs and
then check that throttle/execbuf do EIO that risk is averted and I'd be ok
with eating EIO from wait_request with extreme prejudice. Since indeed we
still have trouble with EIO at least temporarily totally wreaking modeset
ioclts and other things that really always should work.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch