[igt-dev] [PATCH i-g-t v2] tests/gem_watchdog: Initial set of tests for GPU watchdog
Antonio Argenziano
antonio.argenziano at intel.com
Tue Oct 9 16:12:46 UTC 2018
On 09/10/18 01:27, Chris Wilson wrote:
> Quoting Antonio Argenziano (2018-10-08 22:43:39)
>>
>>
>> On 05/10/18 18:05, Carlos Santa wrote:
>>> This test adds basic set of tests to reset the different
>>> GPU engines through the watchdog timer.
>>>
>>> Credits to Antonio for the original codebase this is based on.
>>>
>>> This was verified on SKL/ULT GT3:
>>>
>>> $./gem_watchdog --run-subtest basic-vecs0
>>> IGT-Version: 1.23-gaaeb2007206d (x86_64) (Linux: 4.18.0-rc7+ x86_64)
>>> Starting subtest: basic-vecs0
>>> Subtest basic-vecs0: SUCCESS (2.402s)
>>> $ sudo cat /sys/kernel/debug/dri/0/i915_reset_info
>>> full gpu reset = 0
>>> GuC watchdog/media reset = 0
>>> rcs0 = 0
>>> bcs0 = 0
>>> vcs0 = 0
>>> vcs1 = 0
>>> vecs0 = 1
>>>
>>> v2: (Review comments from Chris Wilson)
>>> * Replace send_canary() by timestamps before/after the hang
>>> and measure dt. Use dt < 2*threshold + reset + submission
>>> to check watchdog vs hangcheck
>>> * Initialize drm_i915_gem_context_param args only once at
>>> the struct declaration
>>> * Avoid using MAX_ENGINES implicitly to declare engines_thresholds
>>> array
>>> * Remove unnecessary igt_assert(!check_error_state(fd))
>>> * Use the class:instance interface when looping through the engines
>>>
>>> (Review by Petri Latvala)
>>> * Update the correct patch's year timestamp
>>> * Include IGT_DESCRIPTION() label
>>>
>>> Cc: Antonio Argenziano <antonio.argenziano at intel.com>
>>> Cc: Michel Thierry <michel.thierry at intel.com>
>>> Cc: Chris Wilson <chris at chris-wilson.co.uk>
>>> Signed-off-by: Carlos Santa <carlos.santa at intel.com>
>>> ---
>>> tests/Makefile.sources | 1 +
>>> tests/gem_watchdog.c | 186 +++++++++++++++++++++++++++++++++++++++++++++++++
>>> tests/meson.build | 1 +
>>> 3 files changed, 188 insertions(+)
>>> create mode 100644 tests/gem_watchdog.c
>>>
>>
>>
>>> +
>>> +static void media_hang_simple(int fd, const struct intel_execution_engine2 *e)
>>> +{
>>> + uint32_t ctx;
>>> + unsigned flags = HANG_ALLOW_CAPTURE;
>>> + struct timeval start, end;
>>> + double dt_msecs;
>>> +
>>> + /* Submit on default context */
>>> + ctx = 0;
>>> + context_set_watchdog(fd, e, ctx, WATCHDOG_THRESHOLD);
>>> +
>>> + clear_error_state(fd);
>>> +
>>> + gettimeofday(&start, NULL);
>>> + inject_hang(fd, ctx, e, flags);
>>> + gettimeofday(&end, NULL);
>>> + dt_msecs = elapsed(&start, &end)/1000;
>>> +
>>> + /* reset time for watchdog should be less than 2*threshold + engine reset time + submission */
>>> + igt_assert(dt_msecs < 2*WATCHDOG_THRESHOLD + 15);
>>> +
>>> + /* Assert if error state is not clean */
>>> + igt_assert(!check_error_state(fd));
>>
>> If you have a small enough threshold you don't need the dmesg check for
>> resets. IMO, ideally there would be a way to have the hang_detector
>> running but that doesn't work because it would disable watchdog as well.
>
> There will be no be capture. The stop_machine() employed there is the
> antithesis of light and fast reset. There will be no uevent since that
> is the global reset, so hang detector will not work. The way to detect
> the failure is to check the fence status which will report the reset of
> that particular request. inject_hang() needs to be reworked and
> gettimeofday() can be replaced by any of the igt timers.
What I was trying to say is that if we had the interface to say have
watchdog and global reset we could use the hang detector to fail the
test but, I guess the fence would make things simpler.
> context_set_watchdog uapi is very wrong. The assert that dt is less than
> 2*threshold is intriguing since the kernel patches should require 2
> evaluation intervals to detect a stuck seqno, and the arbitrary epsilon
> will be upset by the slightest sneeze, let alone scheduler misbehaviour
> and kasan.
Would making the process realtime solve the problem? I think you
originally were suggesting to use 3 batches with the hang sandwiched
between timestamps. I wonder if we could do it with only one batch like:
TIMESTAMP
TIMESTAMP <--+
LOOP --------+
But I am not sure if the second timestamp might get corrupted in the reset.
>
>>> +}
>>> +
>>> +igt_main
>>> +{
>>> + int fd = -1;
>>> +
>>> + igt_skip_on_simulation();
>
> This is definitely suitable for sim as we are testing control of hw
> features (which should help underline how far off dt asserts will be).
I think an allow_hang() and a feature based igt_require() in the fixture
below are needed.
>
>>> + igt_fixture {
>>> + fd = drm_open_driver(DRIVER_INTEL);
>>> + igt_require_gem(fd);
>>> + }
>>> +
>>> + for (const struct intel_execution_engine2 *e = intel_execution_engines2; e->name; e++) {
>>> + igt_subtest_group {
>>> + igt_fixture {
>>> + gem_require_engine(fd, e->class, e->instance);
>>
>> Move the require inside the subtest to maintain a consistent test list
>> across platforms.
>>
>> Thanks,
>> Antonio
>>
>>> + }
>>> +
>>> + /* default exec-id is purely symbolic */
>
> And this is purely garbage.
I think that was me, a long time ago...
Antonio
> -Chris
>
More information about the igt-dev
mailing list