[Intel-gfx] [PATCH 1/2] tests/gem_eio: New ABI - no EIO even from wait_ioctl

Tue Dec 1 01:04:23 PST 2015

On Tue, Dec 01, 2015 at 09:28:08AM +0100, Daniel Vetter wrote:
> On Mon, Nov 30, 2015 at 10:11:12AM +0000, Chris Wilson wrote:
> > On Thu, Nov 26, 2015 at 12:34:34PM +0100, Daniel Vetter wrote:
> > > So there's 3 competing proposals for what wait_ioctl should do wrt
> > > -EIO:
> > > 
> > > - return -EIO when the gpu is wedged. Not terribly useful for
> > >   userspace since it might race with a hang and then there's no
> > >   guarantee that a subsequent execbuf won't end up in an -EIO.
> > >   Terminally wedge really can only be reliably signalled at execbuf
> > >   time, and userspace needs to cope with that (or decide not to
> > >   bother).
> > > 
> > > - EIO for any obj that suffered from a reset. This means big internal
> > >   reorginazation in the kernel since currently we track reset stats
> > >   per-ctx and not on the obj. That's also what arb robustness wants.
> > >   We could do this, but this feels like new ABI territory with the
> > >   usual userspace requirements and high hurdles.
> > > 
> > > - No -EIO at all. Consistent with set_domain_ioctl and simplest to
> > >   implement. Which is what this patch does.
> > 
> > Since no one else is weighing into the ABI discussion, I'm happy with
> > losing EIO here. I thought it could be useful, but as no one is using or
> > seems likely to start using it, begone.
> > 
> > > We can always opt to change this later on if there's a real need.
> > > 
> > > To make the test really exercise this do a full wedged gpu hang, to
> > > make sure -EIO doesn't leak out at all.
> > > 
> > > Cc: Chris Wilson <chris at chris-wilson.co.uk>
> > > Signed-off-by: Daniel Vetter <daniel.vetter at intel.com>
> > > ---
> > >  tests/gem_eio.c | 6 +++++-
> > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/tests/gem_eio.c b/tests/gem_eio.c
> > > index a24c8f1c53b5..8345d1a7a429 100644
> > > --- a/tests/gem_eio.c
> > > +++ b/tests/gem_eio.c
> > > @@ -161,10 +161,14 @@ static void test_wait(int fd)
> > >  {
> > >  	igt_hang_ring_t hang;
> > >  
> > > +	igt_require(i915_reset_control(false));
> > 
> > However, this is not required to test the ABI change above as the wait
> > itself will still hang, whether or not it wedges the GPU.
> 
> Yes it's not strictly required, but without it the testcase is fairly
> boring. If we move the check_wedge out of wait_request then a normail gpu
> reset would always return 0 (after retrying a few times perhaps), so I
> figured testing the wedged case is the only one that's worth it.

But wedging during the hang is also not interesting as we have no
opportunity to see the reset failure in the test case. Putting the GPU
into the wedged state before the wait, should be a trivial test that the
object is idle after the reset.

> Maybe we should dupe the subtests all and have wedged and non-wedged cases
> for all of them? That would imo make more sense.

The others, what matters is how we handle the GPU being wedged before we
queue an execbuf or throttling. In terms of testing no error is reported
for the hanging case, we should add tests for set-domain (so that it is
explicit and not reliant on implementation inside lib/), GTT faulting
(that would need both wedged and hanging cases, but we are not likely to
be able to cover all the cases like waiting on fence and other secondary
waits) and throttle.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre