[Intel-gfx] [PATCH] drm/i915: Ignore -EIO from __i915_wait_request() during mmio flip

Wed Jun 17 06:05:50 PDT 2015

On Wed, Jun 17, 2015 at 01:53:55PM +0200, Daniel Vetter wrote:
> On Tue, Jun 16, 2015 at 05:30:19PM +0100, Chris Wilson wrote:
> > On Tue, Jun 16, 2015 at 06:21:53PM +0200, Daniel Vetter wrote:
> > > On Tue, Jun 16, 2015 at 01:10:33PM +0100, Chris Wilson wrote:
> > > > On Mon, Jun 15, 2015 at 06:34:51PM +0200, Daniel Vetter wrote:
> > > > > On Thu, Jun 11, 2015 at 09:01:08PM +0100, Chris Wilson wrote:
> > > > > > On Thu, Jun 11, 2015 at 07:14:28PM +0300, ville.syrjala at linux.intel.com wrote:
> > > > > > > From: Ville Syrjälä <ville.syrjala at linux.intel.com>
> > > > > > > 
> > > > > > > When the GPU gets reset __i915_wait_request() returns -EIO to the
> > > > > > > mmio flip worker. Currently we WARN whenever we get anything other
> > > > > > > than 0. Ignore the -EIO too since it's a perfectly normal thing
> > > > > > > to get during a GPU reset.
> > > > > > 
> > > > > > Nak. I consider it is a bug in __i915_wait_request(). I am discussing
> > > > > > with Thomas Elf how to fix this wrt the next generation of individual
> > > > > > ring resets.
> > > > > 
> > > > > We should only get an -EIO if the gpu is truly gone, but an -EAGAIN when
> > > > > the reset is ongoing. Neither is currently handled. For lockless users we
> > > > > probably want a version of wait_request which just dtrt (of waiting for
> > > > > the reset handler to complete without trying to grab the mutex and then
> > > > > returning). Or some other means of retrying.
> > > > > 
> > > > > Returning -EIO from the low-level wait function still seems appropriate,
> > > > > but callers need to eat/handle it appropriately. WARN_ON isn't it here
> > > > > ofc.
> > > > 
> > > > Bleh, a few years ago you decided not to take the EIO handling along the
> > > > call paths that don't care.
> > > > 
> > > > I disagree. There are two classes of callers, those that care about
> > > > EIO/EAGAIN and those that simply want to know when the GPU is no longer
> > > > processing that request. That latter class is still popping up in
> > > > bugzilla with frozen displays. For the former, we actually only care
> > > > about backoff if we are holding the mutex - and that is only required
> > > > for EAGAIN. The only user that cares about EIO is throttle().
> > > 
> > > Hm, right now the design is that for non-interruptible designs we indeed
> > > return -EIO or -EAGAIN, but the reset handler will fix up outstanding
> > > flips. So I guess removing the WARN_ON here is indeed the right thing to
> > > do. We should probably change this once we have atomic (where the wait
> > > doesn't need a lock really, at least for async commits which is what
> > > matters here) and loop until completion.
> > > 
> > > I'm still vary of eating -EIO in general since it's so hard to test all
> > > this for correctness. Maybe we need a __check_wedge which can return -EIO
> > > and a check_wedge which eats it. And then decide once for where to put
> > > special checks, probably just execbuf and throttle.
> > 
> > Even execbuf really doesn't care. If the GPU didn't complete the earlier
> > request (principally for semaphore sw sync), it makes no difference for
> > us now. The content is either corrupt, or we bail when we spot the
> > wedged GPU upon writing to the ring. Reporting EIO because of an earlier
> > failure is a poor substitute for the async reset notification. But here
> > we still need EAGAIN backoff ofc.
> > 
> > I really think eating EIO is the right thing to do in most circumstances
> > and is correct with the semantics of the callers.
> 
> Well we once had the transparent sw fallback at least in the ddx for -EIO.
> Mesa never coped for obvious reasons, and given that a modern desktop
> can't survive with GL there's not all that much point any more. But still
> I think if the gpu is terminally dead we need to tell this to userspace
> somehow I think.

The DDX checks throttle() for that purposes. Error returns from
execbuffer usually indicate that the kernel is broken and we promptly
ignore it. Having execbuf report EIO is superflous since it is an async
error from before.

> What I'm unclear about is which ioctl that should be, and my assumption
> thus has been that it's execbuf.

Nope. It's throttle.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre