[igt-dev] [PATCH i-g-t v2] tests/gem_watchdog: Initial set of tests for GPU watchdog

Antonio Argenziano antonio.argenziano at intel.com
Tue Oct 9 16:12:46 UTC 2018



On 09/10/18 01:27, Chris Wilson wrote:
> Quoting Antonio Argenziano (2018-10-08 22:43:39)
>>
>>
>> On 05/10/18 18:05, Carlos Santa wrote:
>>> This test adds basic set of tests to reset the different
>>> GPU engines through the watchdog timer.
>>>
>>> Credits to Antonio for the original codebase this is based on.
>>>
>>> This was verified on SKL/ULT GT3:
>>>
>>> $./gem_watchdog --run-subtest basic-vecs0
>>> IGT-Version: 1.23-gaaeb2007206d (x86_64) (Linux: 4.18.0-rc7+ x86_64)
>>> Starting subtest: basic-vecs0
>>> Subtest basic-vecs0: SUCCESS (2.402s)
>>> $ sudo cat /sys/kernel/debug/dri/0/i915_reset_info
>>> full gpu reset = 0
>>> GuC watchdog/media reset = 0
>>> rcs0 = 0
>>> bcs0 = 0
>>> vcs0 = 0
>>> vcs1 = 0
>>> vecs0 = 1
>>>
>>> v2: (Review comments from Chris Wilson)
>>>    * Replace send_canary() by timestamps before/after the hang
>>>      and measure dt. Use dt < 2*threshold + reset + submission
>>>      to check watchdog vs hangcheck
>>>    * Initialize drm_i915_gem_context_param args only once at
>>>      the struct declaration
>>>    * Avoid using MAX_ENGINES implicitly to declare engines_thresholds
>>>      array
>>>    * Remove unnecessary igt_assert(!check_error_state(fd))
>>>    * Use the class:instance interface when looping through the engines
>>>
>>>     (Review by Petri Latvala)
>>>    * Update the correct patch's year timestamp
>>>    * Include IGT_DESCRIPTION() label
>>>
>>> Cc: Antonio Argenziano <antonio.argenziano at intel.com>
>>> Cc: Michel Thierry <michel.thierry at intel.com>
>>> Cc: Chris Wilson <chris at chris-wilson.co.uk>
>>> Signed-off-by: Carlos Santa <carlos.santa at intel.com>
>>> ---
>>>    tests/Makefile.sources |   1 +
>>>    tests/gem_watchdog.c   | 186 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>    tests/meson.build      |   1 +
>>>    3 files changed, 188 insertions(+)
>>>    create mode 100644 tests/gem_watchdog.c
>>>
>>
>>
>>> +
>>> +static void media_hang_simple(int fd, const struct intel_execution_engine2 *e)
>>> +{
>>> +     uint32_t ctx;
>>> +     unsigned flags = HANG_ALLOW_CAPTURE;
>>> +     struct timeval start, end;
>>> +     double dt_msecs;
>>> +
>>> +     /* Submit on default context */
>>> +     ctx = 0;
>>> +     context_set_watchdog(fd, e, ctx, WATCHDOG_THRESHOLD);
>>> +
>>> +     clear_error_state(fd);
>>> +
>>> +     gettimeofday(&start, NULL);
>>> +     inject_hang(fd, ctx, e, flags);
>>> +     gettimeofday(&end, NULL);
>>> +     dt_msecs = elapsed(&start, &end)/1000;
>>> +
>>> +     /* reset time for watchdog should be less than 2*threshold + engine reset time + submission */
>>> +     igt_assert(dt_msecs < 2*WATCHDOG_THRESHOLD + 15);
>>> +
>>> +     /* Assert if error state is not clean */
>>> +     igt_assert(!check_error_state(fd));
>>
>> If you have a small enough threshold you don't need the dmesg check for
>> resets. IMO, ideally there would be a way to have the hang_detector
>> running but that doesn't work because it would disable watchdog as well.
> 
> There will be no be capture. The stop_machine() employed there is the
> antithesis of light and fast reset. There will be no uevent since that
> is the global reset, so hang detector will not work. The way to detect
> the failure is to check the fence status which will report the reset of
> that particular request. inject_hang() needs to be reworked and
> gettimeofday() can be replaced by any of the igt timers.

What I was trying to say is that if we had the interface to say have 
watchdog and global reset we could use the hang detector to fail the 
test but, I guess the fence would make things simpler.

> context_set_watchdog uapi is very wrong. The assert that dt is less than
> 2*threshold is intriguing since the kernel patches should require 2
> evaluation intervals to detect a stuck seqno, and the arbitrary epsilon
> will be upset by the slightest sneeze, let alone scheduler misbehaviour
> and kasan.

Would making the process realtime solve the problem? I think you 
originally were suggesting to use 3 batches with the hang sandwiched 
between timestamps. I wonder if we could do it with only one batch like:

TIMESTAMP
TIMESTAMP <--+
LOOP --------+

But I am not sure if the second timestamp might get corrupted in the reset.

> 
>>> +}
>>> +
>>> +igt_main
>>> +{
>>> +     int fd = -1;
>>> +
>>> +     igt_skip_on_simulation();
> 
> This is definitely suitable for sim as we are testing control of hw
> features (which should help underline how far off dt asserts will be).

I think an allow_hang() and a feature based igt_require() in the fixture 
below are needed.

> 
>>> +     igt_fixture {
>>> +             fd = drm_open_driver(DRIVER_INTEL);
>>> +             igt_require_gem(fd);
>>> +     }
>>> +
>>> +     for (const struct intel_execution_engine2 *e = intel_execution_engines2; e->name; e++) {
>>> +             igt_subtest_group {
>>> +                     igt_fixture {
>>> +                             gem_require_engine(fd, e->class, e->instance);
>>
>> Move the require inside the subtest to maintain a consistent test list
>> across platforms.
>>
>> Thanks,
>> Antonio
>>
>>> +                     }
>>> +
>>> +                     /* default exec-id is purely symbolic */
> 
> And this is purely garbage.

I think that was me, a long time ago...

Antonio

> -Chris
> 


More information about the igt-dev mailing list