[igt-dev] [PATCH i-g-t v2] tests/gem_watchdog: Initial set of tests for GPU watchdog

Chris Wilson chris at chris-wilson.co.uk
Tue Oct 9 08:27:52 UTC 2018


Quoting Antonio Argenziano (2018-10-08 22:43:39)
> 
> 
> On 05/10/18 18:05, Carlos Santa wrote:
> > This test adds basic set of tests to reset the different
> > GPU engines through the watchdog timer.
> > 
> > Credits to Antonio for the original codebase this is based on.
> > 
> > This was verified on SKL/ULT GT3:
> > 
> > $./gem_watchdog --run-subtest basic-vecs0
> > IGT-Version: 1.23-gaaeb2007206d (x86_64) (Linux: 4.18.0-rc7+ x86_64)
> > Starting subtest: basic-vecs0
> > Subtest basic-vecs0: SUCCESS (2.402s)
> > $ sudo cat /sys/kernel/debug/dri/0/i915_reset_info
> > full gpu reset = 0
> > GuC watchdog/media reset = 0
> > rcs0 = 0
> > bcs0 = 0
> > vcs0 = 0
> > vcs1 = 0
> > vecs0 = 1
> > 
> > v2: (Review comments from Chris Wilson)
> >   * Replace send_canary() by timestamps before/after the hang
> >     and measure dt. Use dt < 2*threshold + reset + submission
> >     to check watchdog vs hangcheck
> >   * Initialize drm_i915_gem_context_param args only once at
> >     the struct declaration
> >   * Avoid using MAX_ENGINES implicitly to declare engines_thresholds
> >     array
> >   * Remove unnecessary igt_assert(!check_error_state(fd))
> >   * Use the class:instance interface when looping through the engines
> > 
> >    (Review by Petri Latvala)
> >   * Update the correct patch's year timestamp
> >   * Include IGT_DESCRIPTION() label
> > 
> > Cc: Antonio Argenziano <antonio.argenziano at intel.com>
> > Cc: Michel Thierry <michel.thierry at intel.com>
> > Cc: Chris Wilson <chris at chris-wilson.co.uk>
> > Signed-off-by: Carlos Santa <carlos.santa at intel.com>
> > ---
> >   tests/Makefile.sources |   1 +
> >   tests/gem_watchdog.c   | 186 +++++++++++++++++++++++++++++++++++++++++++++++++
> >   tests/meson.build      |   1 +
> >   3 files changed, 188 insertions(+)
> >   create mode 100644 tests/gem_watchdog.c
> > 
> 
> 
> > +
> > +static void media_hang_simple(int fd, const struct intel_execution_engine2 *e)
> > +{
> > +     uint32_t ctx;
> > +     unsigned flags = HANG_ALLOW_CAPTURE;
> > +     struct timeval start, end;
> > +     double dt_msecs;
> > +
> > +     /* Submit on default context */
> > +     ctx = 0;
> > +     context_set_watchdog(fd, e, ctx, WATCHDOG_THRESHOLD);
> > +
> > +     clear_error_state(fd);
> > +
> > +     gettimeofday(&start, NULL);
> > +     inject_hang(fd, ctx, e, flags);
> > +     gettimeofday(&end, NULL);
> > +     dt_msecs = elapsed(&start, &end)/1000;
> > +
> > +     /* reset time for watchdog should be less than 2*threshold + engine reset time + submission */
> > +     igt_assert(dt_msecs < 2*WATCHDOG_THRESHOLD + 15);
> > +
> > +     /* Assert if error state is not clean */
> > +     igt_assert(!check_error_state(fd));
> 
> If you have a small enough threshold you don't need the dmesg check for 
> resets. IMO, ideally there would be a way to have the hang_detector 
> running but that doesn't work because it would disable watchdog as well.

There will be no be capture. The stop_machine() employed there is the
antithesis of light and fast reset. There will be no uevent since that
is the global reset, so hang detector will not work. The way to detect
the failure is to check the fence status which will report the reset of
that particular request. inject_hang() needs to be reworked and
gettimeofday() can be replaced by any of the igt timers.
context_set_watchdog uapi is very wrong. The assert that dt is less than
2*threshold is intriguing since the kernel patches should require 2
evaluation intervals to detect a stuck seqno, and the arbitrary epsilon
will be upset by the slightest sneeze, let alone scheduler misbehaviour
and kasan.

> > +}
> > +
> > +igt_main
> > +{
> > +     int fd = -1;
> > +
> > +     igt_skip_on_simulation();

This is definitely suitable for sim as we are testing control of hw
features (which should help underline how far off dt asserts will be).

> > +     igt_fixture {
> > +             fd = drm_open_driver(DRIVER_INTEL);
> > +             igt_require_gem(fd);
> > +     }
> > +
> > +     for (const struct intel_execution_engine2 *e = intel_execution_engines2; e->name; e++) {
> > +             igt_subtest_group {
> > +                     igt_fixture {
> > +                             gem_require_engine(fd, e->class, e->instance);
> 
> Move the require inside the subtest to maintain a consistent test list 
> across platforms.
> 
> Thanks,
> Antonio
> 
> > +                     }
> > +
> > +                     /* default exec-id is purely symbolic */

And this is purely garbage.
-Chris


More information about the igt-dev mailing list