[Intel-gfx] [igt-dev] [PATCH i-g-t] igt/gem_eio: Measure reset delay from thread

Fri Aug 3 14:10:41 UTC 2018

Quoting Tvrtko Ursulin (2018-08-03 15:03:05)
> 
> On 27/07/2018 12:13, Chris Wilson wrote:
> > We assert that we complete a wedge within 250ms. However, when we use a
> > thread to delay the wedging until after we start waiting, that thread
> > itself is delayed longer than our wait timeout. This results in a false
> > positive error where we fail the test before we even trigger the reset.
> > 
> > Reorder the test so that we only ever measure the delay from triggering
> > the reset until we wakeup, and assert that is in a timely fashion
> > (less than 250ms).
> > 
> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=105954
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > ---
> >   tests/gem_eio.c | 45 ++++++++++++++++++++++++---------------------
> >   1 file changed, 24 insertions(+), 21 deletions(-)
> > 
> > diff --git a/tests/gem_eio.c b/tests/gem_eio.c
> > index 3162a3170..b26b4b895 100644
> > --- a/tests/gem_eio.c
> > +++ b/tests/gem_eio.c
> > @@ -190,7 +190,8 @@ static igt_spin_t * spin_sync(int fd, uint32_t ctx, unsigned long flags)
> >   
> >   struct hang_ctx {
> >       int debugfs;
> > -     struct timespec ts;
> > +     struct timespec delay;
> > +     struct timespec *ts;
> >       timer_t timer;
> >   };
> >   
> > @@ -198,8 +199,10 @@ static void hang_handler(union sigval arg)
> >   {
> >       struct hang_ctx *ctx = arg.sival_ptr;
> >   
> > -     igt_debug("hang delay = %.2fus\n", igt_nsec_elapsed(&ctx->ts) / 1000.0);
> > +     igt_debug("hang delay = %.2fus\n",
> > +               igt_nsec_elapsed(&ctx->delay) / 1000.0);
> >   
> > +     igt_nsec_elapsed(ctx->ts);
> >       igt_assert(igt_sysfs_set(ctx->debugfs, "i915_wedged", "-1"));
> >   
> >       igt_assert_eq(timer_delete(ctx->timer), 0);
> > @@ -207,7 +210,7 @@ static void hang_handler(union sigval arg)
> >       free(ctx);
> >   }
> >   
> > -static void hang_after(int fd, unsigned int us)
> > +static void hang_after(int fd, unsigned int us, struct timespec *ts)
> >   {
> >       struct sigevent sev = {
> >               .sigev_notify = SIGEV_THREAD,
> > @@ -229,30 +232,30 @@ static void hang_after(int fd, unsigned int us)
> >   
> >       igt_assert_eq(timer_create(CLOCK_MONOTONIC, &sev, &ctx->timer), 0);
> >   
> > -     igt_nsec_elapsed(&ctx->ts);
> > +     ctx->ts = ts;
> > +     igt_nsec_elapsed(&ctx->delay);
> >   
> >       igt_assert_eq(timer_settime(ctx->timer, 0, &its, NULL), 0);
> >   }
> >   
> > -static int __check_wait(int fd, uint32_t bo, unsigned int wait)
> > +static void __check_wait(int fd, uint32_t bo, unsigned int wait)
> >   {
> > -     unsigned long wait_timeout = 250e6; /* Some margin for actual reset. */
> > -     int ret;
> > +     struct timespec ts = {};
> > +     uint64_t elapsed;
> >   
> >       if (wait) {
> > -             /*
> > -              * Double the wait plus some fixed margin to ensure gem_wait
> > -              * can never time out before the async hang runs.
> > -              */
> > -             wait_timeout += wait * 2000 + 250e6;
> > -             hang_after(fd, wait);
> > +             hang_after(fd, wait, &ts);
> >       } else {
> > +             igt_nsec_elapsed(&ts);
> >               manual_hang(fd);
> >       }
> >   
> > -     ret = __gem_wait(fd, bo, wait_timeout);
> > +     gem_sync(fd, bo);
> 
> One drawback I can see is if reset fails to work then we hang until IGT 
> runner timeout. Alternative would be a smaller patch to bump the wait 
> delay which should work given how rare are the failures. It was one of 
> the goals of the gem_eio rewrite I think, but whatever, I suppose 
> runtime of test when reset is broken is not that important.

It hard to put an upper bound on just how much the scheduler is going to
screw us over. On the other hand, this let's us say precisely which
portion must be done in the timely fashion (reset -> wakeup), and for
that bound can be invariant. Which I think is nice.

The downside is indeed that instead of getting a clear fail with
explanation if the reset fails completely (not even able to wedge the
device), and that hangcheck itself fails to wedge eventually, then we do
rely on the external runner to declare the test broken. I think that's
fair, it requires a series of events to fail; and in particular
hangcheck is supposed to be guarding against even its own failure.
-Chris