[Intel-gfx] [PATCH 1/2] tests/gem_eio: New ABI - no EIO even from wait_ioctl

Thu Dec 3 01:00:11 PST 2015

On Thu, Dec 03, 2015 at 09:50:25AM +0100, Daniel Vetter wrote:
> On Tue, Dec 01, 2015 at 09:20:02AM +0000, Chris Wilson wrote:
> > On Tue, Dec 01, 2015 at 10:13:10AM +0100, Daniel Vetter wrote:
> > > On Tue, Dec 01, 2015 at 09:04:23AM +0000, Chris Wilson wrote:
> > > > On Tue, Dec 01, 2015 at 09:28:08AM +0100, Daniel Vetter wrote:
> > > > > On Mon, Nov 30, 2015 at 10:11:12AM +0000, Chris Wilson wrote:
> > > > > > On Thu, Nov 26, 2015 at 12:34:34PM +0100, Daniel Vetter wrote:
> > > > > > > So there's 3 competing proposals for what wait_ioctl should do wrt
> > > > > > > -EIO:
> > > > > > > 
> > > > > > > - return -EIO when the gpu is wedged. Not terribly useful for
> > > > > > >   userspace since it might race with a hang and then there's no
> > > > > > >   guarantee that a subsequent execbuf won't end up in an -EIO.
> > > > > > >   Terminally wedge really can only be reliably signalled at execbuf
> > > > > > >   time, and userspace needs to cope with that (or decide not to
> > > > > > >   bother).
> > > > > > > 
> > > > > > > - EIO for any obj that suffered from a reset. This means big internal
> > > > > > >   reorginazation in the kernel since currently we track reset stats
> > > > > > >   per-ctx and not on the obj. That's also what arb robustness wants.
> > > > > > >   We could do this, but this feels like new ABI territory with the
> > > > > > >   usual userspace requirements and high hurdles.
> > > > > > > 
> > > > > > > - No -EIO at all. Consistent with set_domain_ioctl and simplest to
> > > > > > >   implement. Which is what this patch does.
> > > > > > 
> > > > > > Since no one else is weighing into the ABI discussion, I'm happy with
> > > > > > losing EIO here. I thought it could be useful, but as no one is using or
> > > > > > seems likely to start using it, begone.
> > > > > > 
> > > > > > > We can always opt to change this later on if there's a real need.
> > > > > > > 
> > > > > > > To make the test really exercise this do a full wedged gpu hang, to
> > > > > > > make sure -EIO doesn't leak out at all.
> > > > > > > 
> > > > > > > Cc: Chris Wilson <chris at chris-wilson.co.uk>
> > > > > > > Signed-off-by: Daniel Vetter <daniel.vetter at intel.com>
> > > > > > > ---
> > > > > > >  tests/gem_eio.c | 6 +++++-
> > > > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > > > 
> > > > > > > diff --git a/tests/gem_eio.c b/tests/gem_eio.c
> > > > > > > index a24c8f1c53b5..8345d1a7a429 100644
> > > > > > > --- a/tests/gem_eio.c
> > > > > > > +++ b/tests/gem_eio.c
> > > > > > > @@ -161,10 +161,14 @@ static void test_wait(int fd)
> > > > > > >  {
> > > > > > >  	igt_hang_ring_t hang;
> > > > > > >  
> > > > > > > +	igt_require(i915_reset_control(false));
> > > > > > 
> > > > > > However, this is not required to test the ABI change above as the wait
> > > > > > itself will still hang, whether or not it wedges the GPU.
> > > > > 
> > > > > Yes it's not strictly required, but without it the testcase is fairly
> > > > > boring. If we move the check_wedge out of wait_request then a normail gpu
> > > > > reset would always return 0 (after retrying a few times perhaps), so I
> > > > > figured testing the wedged case is the only one that's worth it.
> > > > 
> > > > But wedging during the hang is also not interesting as we have no
> > > > opportunity to see the reset failure in the test case. Putting the GPU
> > > > into the wedged state before the wait, should be a trivial test that the
> > > > object is idle after the reset.
> > > 
> > > Right now (with current kernels) we see an -EIO with this testcase instead
> > > of 0 in the wait. Without disabling reset we see 0 both on fixed and
> > > broken kernels. So I don't really see why not testing this case is a good
> > > idea? It's the one we're currently failing at and leak -EIO to userspace.
> > 
> > Because that wasn't the ABI I was trying to test here and expected to
> > maintain! I was just aiming for covering the ioctl calls with expected
> > EIO (to do the opposite would need something like trinity or a guided
> > fuzzer).
> 
> Well yeah, but I thought we decided to change it? So Ack or not?

If you change the subtest to do both, the more likely case where there
is a GPU hang whilst we wait for a request, and the second where the GPU
hang leads to a failed reset. There's actually a another pair where the
GPU hangs (and reset succeeds or not) before we call the wait_ioctl. And
then we can use the guts of that function to do the set-domain and
gtt-fault testing.

As far as wait-ioctl reporting 0 if the request is completed due to a
reset, that's an ACK.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre