[igt-dev] [RFC v5] tests/gem_watchdog: Initial set of tests for GPU watchdog

Carlos Santa carlos.santa at intel.com
Tue Jun 11 01:55:10 UTC 2019


On Wed, 2019-05-29 at 10:03 +0100, Chris Wilson wrote:
> Quoting Carlos Santa (2019-05-28 22:35:59)
> > This test adds basic set of tests to reset the different
> > GPU engines through the gpu watchdog timer.
> > 
> > Credits to Antonio for the original codebase this is based on.
> > 
> > v2: remove gem_context_get_param() during set (Antonio)
> >     remove clearing of the engines_threshold[] in the default case
> >     inside context_set_watchdog(). (Antonio)
> >     fix indexing when creating low/high priority contexts
> >     get rid of 2 threads idea (Antonio)
> >     fix context prio bug due to wrong indexing (Antonio)
> > v3: no need to force clear drm_i915_gem_watchdog_timeout struct
> >     when setting the watchdog timeouts (Antonio)
> >     -use gem_context_set_priority() instead to use error checking
> >     -always use a looper to test all engines (Antonio)
> >     -modify test gpu_watchodg_hang_long_batch_single_engine()
> >     to use a couple of batches and set the timeout in between them
> > (Antonio)
> >     -make use of for_each_engine_class_instance() (Antonio)
> > v4: -tests should be able to handle s/w watchdog timeout (Tvrtko)
> >     -remove dead code inside batch_buffer factory (Tvrtko)
> >     -set_watchdog() should return ENODEV for fail cases (Tvrtko)
> >     -enclose name, priority and reset params inside array (Tvrtko)
> > v5: -Rebased. Tests 5,6 and 8 from the test plan are now added
> > 
> > Test Plan:
> > 
> > Assumptions:
> > 1. Use fence status to figure out which ctx was reset
> > 2. Use spin batches w/ user space timers to control duration
> > and corking to control ordering
> > 3. Use context priorities to force preemption
> > 
> > Tests:
> > 1. ctx1, long_batch -> execute
> > 2. ctx1, set_watchdog -> reset
> > 3. ctx2/ctx1 -> execute, reset
> > 4. ctx1/ctx2 -> reset, execute
> > 5. ctx1_just_below_threshold -> execute
> > 6. ctx_over_the_threshold -> reset
> > 7. set watchdog on some engines -> execute
> > 8. submit long_batch and after half of the
> > expected runtime submit higher prio batch
> > 9. submit low prio batch w/o watchdog then
> > higher prio with watchdog
> > 
> > Unsresolved items:
> > 1. The check for sync_fence_status(*fence)
> > returns -1 instead of EIO.
> > 
> > Cc: Ursulin Tvrtko <tvrtko.ursulin at intel.com>
> > Cc: Antonio Argenziano <antonio.argenziano at intel.com>
> > Cc: Chris Wilson <chris at chris-wilson.co.uk>
> > Signed-off-by: Carlos Santa <carlos.santa at intel.com>
> > ---
> >  tests/Makefile.sources    |   3 +
> >  tests/i915/gem_watchdog.c | 490
> > ++++++++++++++++++++++++++++++++++++++
> >  tests/meson.build         |   1 +
> >  3 files changed, 494 insertions(+)
> >  create mode 100644 tests/i915/gem_watchdog.c
> > 
> > diff --git a/tests/Makefile.sources b/tests/Makefile.sources
> > index 2d5c929e32fc..65b868e7d927 100644
> > --- a/tests/Makefile.sources
> > +++ b/tests/Makefile.sources
> > @@ -446,6 +446,9 @@ gem_userptr_blits_SOURCES =
> > i915/gem_userptr_blits.c
> >  TESTS_progs += gem_wait
> >  gem_wait_SOURCES = i915/gem_wait.c
> >  
> > +TESTS_progs += gem_watchdog
> > +gem_watchdog_SOURCES = i915/gem_watchdog.c
> > +
> >  TESTS_progs += gem_workarounds
> >  gem_workarounds_SOURCES = i915/gem_workarounds.c
> >  
> > diff --git a/tests/i915/gem_watchdog.c b/tests/i915/gem_watchdog.c
> > new file mode 100644
> > index 000000000000..e9cdb886d3d6
> > --- /dev/null
> > +++ b/tests/i915/gem_watchdog.c
> > @@ -0,0 +1,490 @@
> > +/*
> > + * Copyright © 2019 Intel Corporation
> > + *
> > + * Permission is hereby granted, free of charge, to any person
> > obtaining a
> > + * copy of this software and associated documentation files (the
> > "Software"),
> > + * to deal in the Software without restriction, including without
> > limitation
> > + * the rights to use, copy, modify, merge, publish, distribute,
> > sublicense,
> > + * and/or sell copies of the Software, and to permit persons to
> > whom the
> > + * Software is furnished to do so, subject to the following
> > conditions:
> > + *
> > + * The above copyright notice and this permission notice
> > (including the next
> > + * paragraph) shall be included in all copies or substantial
> > portions of the
> > + * Software.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > EXPRESS OR
> > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > MERCHANTABILITY,
> > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO
> > EVENT SHALL
> > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
> > DAMAGES OR OTHER
> > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> > ARISING
> > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> > OTHER DEALINGS
> > + * IN THE SOFTWARE.
> > + */
> > +#include "igt.h"
> > +#include "igt_sysfs.h"
> > +#include "sw_sync.h"
> > +
> > +#include <pthread.h>
> > +#include <fcntl.h>
> > +
> > +#include <sys/time.h>
> > +#include <sys/ioctl.h>
> > +#include <sys/poll.h>
> > +#include <sys/signal.h>
> > +#include "i915/gem_ring.h"
> > +
> > +#define LOCAL_I915_EXEC_BSD_SHIFT      (13)
> > +#define LOCAL_I915_EXEC_BSD_RING1      (1 <<
> > LOCAL_I915_EXEC_BSD_SHIFT)
> > +#define LOCAL_I915_EXEC_BSD_RING2      (2 <<
> > LOCAL_I915_EXEC_BSD_SHIFT)
> > +
> > +#define MAX_PRIO LOCAL_I915_CONTEXT_MAX_USER_PRIORITY
> > +#define DEFAULT_PRIO LOCAL_I915_CONTEXT_DEFAULT_PRIORITY
> > +#define MIN_PRIO LOCAL_I915_CONTEXT_MIN_USER_PRIORITY
> > +#define HIGH 1
> > +#define LOW 0
> > +#define WATCHDOG_OVER_THRESHOLD (2070000)
> > +#define WATCHDOG_BELOW_THRESHOLD (1200000)
> > +#define WATCHDOG_THRESHOLD (1200)
> > +#define MAX_ENGINES 5
> > +#define RENDER_CLASS 0
> > +#define VIDEO_DECODE_CLASS 1
> > +#define VIDEO_ENHANCEMENT_CLASS 2
> > +#define COPY_ENGINE_CLASS 3
> > +#define LOCAL_I915_CONTEXT_PARAM_WATCHDOG 0x10
> > +
> > +#define GET_RESET_STATS_IOCTL DRM_IOWR(DRM_COMMAND_BASE + 0x32,
> > struct local_drm_i915_reset_stats)
> > +struct local_drm_i915_reset_stats {
> > +       __u32 ctx_id;
> > +       __u32 flags;
> > +       __u32 reset_count;
> > +       __u32 batch_active;
> > +       __u32 batch_pending;
> > +       __u32 pad;
> > +};
> > +
> > +const uint64_t timeout_100ms = 100000000LL;
> > +float timedifference_msec(struct timeval t0, struct timeval t1);
> > +
> > +struct drm_i915_gem_watchdog_timeout {
> > +       union {
> > +               struct {
> > +                       /*
> > +                        * Engine class & instance to be configured
> > or queried.
> > +                        */
> > +                       __u16 engine_class;
> > +                       __u16 engine_instance;
> > +               };
> > +               /* Index based addressing mode */
> > +               __u32 index;
> > +       };
> > +       /* GPU Engine watchdog resets timeout in us */
> > +       __u32 timeout_us;
> > +};
> > +
> > +static void clear_error_state(int fd)
> > +{
> > +       int dir;
> > +
> > +       dir = igt_sysfs_open(fd);
> > +
> > +       if (dir < 0)
> > +               return;
> > +
> > +       /* Any write to the error state clears it */
> > +       igt_sysfs_set(dir, "error", "");
> > +       close(dir);
> > +}
> > +
> > +static int context_set_watchdog(int fd, unsigned engine_id,
> > +                                const char *engine_name,
> > +                                 unsigned ctx_id, unsigned
> > threshold)
> > +{
> > +       struct drm_i915_gem_watchdog_timeout
> > engines_threshold[MAX_ENGINES];
> > +       struct drm_i915_gem_context_param arg = {
> > +               .param = LOCAL_I915_CONTEXT_PARAM_WATCHDOG,
> > +               .ctx_id = ctx_id,
> > +               .size = sizeof(engines_threshold),
> 
> Won't the kernel complain about the invalid values?
> 
> > +               .value = (uint64_t)&engines_threshold
> > +       };
> > +
> > +       memset(&engines_threshold, 0, sizeof(engines_threshold));
> 
> You set all the unspecified values to be rcs0 with 0 timeout.
> 
> > +
> > +       switch (engine_id & I915_EXEC_RING_MASK) {
> > +       case I915_EXEC_RENDER:
> > +               engines_threshold[0].timeout_us = threshold;
> 
> I presume this is a u64, so might as well use _ns for consistency.
> 
> > +               engines_threshold[0].engine_class = RENDER_CLASS;
> > +               engines_threshold[0].engine_instance = 0;
> 
> Use struct i915_engine_class_instance.
> 
> > +               break;
> > +       case I915_EXEC_BLT:
> > +               if (__gem_context_get_param(fd, &arg) == -ENODEV)
> > +                       return -ENODEV;
> 
> And otherwise setting random values into engine_thresholds.
> 
> > +               else
> > +                       engines_threshold[3].timeout_us =
> > threshold;
> > +                       engines_threshold[3].engine_class =
> > COPY_ENGINE_CLASS;
> > +                       engines_threshold[3].engine_instance = 0;
> > +               break;
> > +       case I915_EXEC_BSD:
> > +               engines_threshold[1].timeout_us = threshold;
> > +               engines_threshold[1].engine_class =
> > VIDEO_DECODE_CLASS;
> > +               engines_threshold[1].engine_instance = 0;
> > +               break;
> > +       case I915_EXEC_VEBOX:
> > +               engines_threshold[2].timeout_us = threshold;
> > +               engines_threshold[2].engine_class =
> > VIDEO_ENHANCEMENT_CLASS;
> > +               engines_threshold[2].engine_instance = 0;
> > +               break;
> > +       default:
> > +               break;
> > +       }
> 
> Ok, so you only set one engine and the rest to zero. That seems like
> a
> bad example in how to use the uAPI.
> 
> > +       gem_context_set_param(fd, &arg);
> > +
> > +       return 0;
> > +}
> > +
> > +static void batch_buffer_factory(uint32_t fd, uint32_t ctx_id,
> > unsigned exec_id,
> > +                                uint32_t target, uint32_t offset,
> > +                                uint32_t *handle, uint64_t
> > timeout,
> > +                                int *fence, int fence_index)
> > +{
> > +    struct drm_i915_gem_exec_object2 obj[2];
> > +    struct drm_i915_gem_relocation_entry reloc;
> > +    struct drm_i915_gem_execbuffer2 execbuf;
> > +    igt_spin_t *spin = NULL;
> > +    const uint32_t bbe = MI_BATCH_BUFFER_END;
> > +    int i = 0;
> > +
> > +    gem_quiescent_gpu(fd);
> > +
> > +    memset(&execbuf, 0, sizeof(execbuf));
> > +    memset(&obj, 0, sizeof(obj));
> > +    memset(&reloc, 0, sizeof(reloc));
> > +
> > +    execbuf.buffers_ptr = to_user_pointer(obj);
> > +
> > +    execbuf.buffer_count = 2;
> > +    execbuf.flags = exec_id | I915_EXEC_FENCE_OUT ;
> > +
> > +    obj[0].handle = target;
> > +    obj[1].handle = gem_create(fd, 4096);
> > +
> > +    obj[1].relocation_count = 1;
> > +    obj[1].relocs_ptr = to_user_pointer(&reloc);
> > +
> > +    reloc.target_handle = obj[0].handle;
> > +    reloc.read_domains = I915_GEM_DOMAIN_COMMAND;
> > +    reloc.write_domain = I915_GEM_DOMAIN_COMMAND;
> > +    reloc.delta = offset * sizeof(uint32_t);
> > +
> > +    reloc.offset = i * sizeof(uint32_t);
> > +    gem_write(fd, obj[1].handle, 0, &bbe, sizeof(bbe));
> > +
> > +    __sync_synchronize();
> 
> What?
> 
> > +
> > +    gem_sync(fd, obj[1].handle);
> > +    execbuf.rsvd1 = ctx_id;
> > +    execbuf.rsvd2 = -1;
> > +
> > +    spin = igt_spin_new(fd, .dependency = obj[0].handle);
> > +    igt_spin_set_timeout(spin, timeout);
> > +    igt_assert(gem_bo_busy(fd, obj[0].handle));
> > +
> > +    gem_execbuf_wr(fd, &execbuf);
> > +    igt_spin_free(fd, spin);
> 
> This still doesn't do what you think it does. Hint, this never runs
> to
> timeout.
> 
> > +    fence[fence_index] = execbuf.rsvd2 >> 32;
> 
> You can just ask the spinner to return a fence.
> spin = igt_spin_new(fd, .engine =engine, .flags =
> IGT_SPIN_FENCE_OUT);
> then fence = spin->out_fence.

Hi Chris,

Can you help me understand the above a little bit more?

The spinner looks like this now:

    spin = igt_spin_new(fd,
			.dependency = obj[0].handle,
			.engine = exec_id,
			.flags = IGT_SPIN_FENCE_OUT);
    igt_assert(spin->out_fence != -1);
    igt_spin_set_timeout(spin, timeout);
    igt_assert(gem_bo_busy(fd, obj[0].handle));

And then I store the fence as:

fence[fence_index] = spin->out_fence;

If at this point (creation of the batch_buffer work) I check for the
fence and the status of it then I get fence: 5 and status: 0

printf("fence: %d status: %d \n",spin-
>out_fence,sync_fence_status(spin->out_fence));

The 0 I think refers to this fence still being active, which I think is
correct.

However, after I force reset the engines (some delta time after)
through watchdog then I get a -9 for the status of the fence,

Why am I not getting a -5 or -EIO after the reset?

Carlos 

> 
> > +    gem_close(fd, obj[1].handle);
> > +    gem_quiescent_gpu(fd);
> > +}
> > +
> > +static void inject_hang(uint32_t fd, unsigned engine_id, uint32_t
> > ctx_id,
> > +                        unsigned flags)
> > +{
> > +       igt_hang_t hang;
> > +       hang = igt_hang_ctx(fd, ctx_id, engine_id, flags);
> > +       gem_sync(fd, hang.spin->handle);
> > +}
> > +
> > +/* Test#1: create some work and let it run on all engines */
> > +static void long_batch_test1(int fd, int prio1, int prio2, int
> > reset_ctx1,
> > +                       int reset_ctx2, unsigned threshold)
> > +{
> > +       unsigned engine_id = 0;
> > +       unsigned nengine = 0;
> > +       struct intel_execution_engine2 *e_;
> > +       uint32_t ctx[8];
> > +       uint32_t scratch[8];
> > +       uint32_t ctx2[8];
> > +       uint32_t scratch2[8];
> 
> A struct might spell it out a little more concisely.
> 
> And 8! That's cutting it a little fine. MAX_ENGINES is 64 and we are
> already dreading the day we exceed that.
> 
> > +       int *fence, i = 0;
> > +       const uint64_t batch_timeout_ms = timeout_100ms * 4;
> > +
> > +       fence = (int *)malloc(sizeof(int)*8);
> 
> Please be kind to readers. The cast is implicit, and spaces are
> required
> around '*'.
> 
> > +       igt_assert(fence);
> > +
> > +       __for_each_physical_engine(fd, e_) {
> > +               engine_id = e_->flags;
> > +
> > +               scratch2[nengine] = gem_create(fd, 4096);
> > +               ctx2[nengine] = gem_context_create(fd);
> > +               gem_context_set_priority(fd, ctx2[nengine], prio2);
> > +               batch_buffer_factory(fd, ctx2[nengine], engine_id,
> > +                                scratch2[nengine], 0, NULL,
> > +                                batch_timeout_ms, fence, nengine);
> > +
> > +               scratch[nengine] = gem_create(fd, 4096);
> > +               ctx[nengine] = gem_context_create(fd);
> > +               gem_context_set_priority(fd, ctx[nengine], prio1);
> > +               batch_buffer_factory(fd, ctx[nengine], engine_id,
> > +                                scratch[nengine], 0, NULL,
> > +                                batch_timeout_ms, fence, nengine);
> > +               igt_info("Test #1: ctx1/ctx2 --> execute on
> > %s\n",e_->name);
> > +               nengine++;
> 
> So that doesn't test anything at all.
> 
> > +       }
> > +
> > +       for (i = 0; i < nengine; i++) {
> > +               close(fence[i]);
> > +               gem_context_destroy(fd, ctx[i]);
> > +               gem_context_destroy(fd, ctx2[i]);
> > +               gem_close(fd, scratch[i]);
> > +               gem_close(fd, scratch2[i]);
> > +       }
> > +}
> > +
> > +float timedifference_msec(struct timeval t0, struct timeval t1)
> > +{
> > +    return (t1.tv_sec - t0.tv_sec) * 1000.0f + (t1.tv_usec -
> > t0.tv_usec) / 1000.0f;
> > +}
> > +
> > +/* Test#2: submit a long batch and after half of the expected
> > runtime
> > +   submit a higher priority batch and then try to cancel the
> > execution*/
> > +static void long_batch_test2(int fd, int prio1, int prio2, int
> > reset_ctx1,
> > +                       int reset_ctx2, unsigned threshold)
> > +{
> > +       unsigned engine_id = 0;
> > +       unsigned nengine = 0;
> > +       struct intel_execution_engine2 *e_;
> > +       uint32_t ctx[8];
> > +       uint32_t scratch[8];
> > +       uint32_t ctx2[8];
> > +       uint32_t scratch2[8];
> > +       int *fence, i = 0;
> > +       unsigned flags = HANG_ALLOW_CAPTURE;
> > +       const uint64_t batch_timeout_ms = timeout_100ms * 4;
> > +       struct timeval t0;
> > +       struct timeval t1;
> > +       float elapsed;
> > +
> > +       fence = (int *)malloc(sizeof(int)*8);
> > +       igt_assert(fence);
> > +
> > +       nengine = 0;
> > +
> > +       __for_each_physical_engine(fd, e_) {
> > +               engine_id = e_->flags;
> > +
> > +               scratch2[nengine] = gem_create(fd, 4096);
> > +               ctx2[nengine] = gem_context_create(fd);
> > +               gem_context_set_priority(fd, ctx2[nengine], prio2);
> > +               batch_buffer_factory(fd, ctx2[nengine], engine_id,
> > +                                scratch2[nengine], 0, NULL,
> > +                                batch_timeout_ms, fence, nengine);
> > +
> > +               gettimeofday(&t0, 0);
> > +               while(elapsed < 200) {
> > +                       gettimeofday(&t1, 0);
> > +                       elapsed += timedifference_msec(t0, t1);
> > +               }
> 
> See usleep() or nanosleep()
> 
> > +               scratch[nengine] = gem_create(fd, 4096);
> > +               ctx[nengine] = gem_context_create(fd);
> > +               gem_context_set_priority(fd, ctx[nengine], prio1);
> > +               batch_buffer_factory(fd, ctx[nengine], engine_id,
> > +                                scratch[nengine], 0, NULL,
> > +                                batch_timeout_ms, fence, nengine);
> > +
> > +               if (context_set_watchdog(fd, engine_id, e_->name,
> > +                                        ctx[nengine],
> > +                                        threshold) == -ENODEV) {
> > +                       igt_info("No support for gpu h/w watchdog
> > on %s\n",
> > +                        e_->name);
> > +                       goto skip_case;
> > +               }
> > +               clear_error_state(fd);
> > +               inject_hang(fd, engine_id, ctx[nengine], flags);
> > +               /* Now check the engine was reset successfully */
> > +               //igt_assert_eq(sync_fence_status(*fence), -EIO);
> 
> Yeah, that fence has nothing to do with anything.
> 
> > +               igt_info("Test #2 ctx1/ctx2 --> set watchdog and"
> > +                               " cancel ctx2 at half expected run
> > with higher"
> > +                               " priority engine: %s, fence
> > status: 0x%x \n",
> > +                                       e_->name,
> > sync_fence_status(*fence));
> 
> You've set a watchdog, but don't validate that it had any effect.
> 
> May I suggest, you using a timing batch. Write the initial timestamp
> to
> scratch, and write a second timestamp and recurse back to writing the
> second timestamp. After that batch is forcibly reset, you can compare
> the two timestamps to see how long it executed for (without having to
> dig into the context image to extract the ppHWSP runtimes). The
> contract
> given by the watchdog uAPI would seem to be that the batch is hung
> after
> that runtime exceeds the timeout threshold (though how strictly we
> fail
> for exceeding the timeout???) and most definitely not be reset before
> we
> reach the threshold (give or take a few microseconds required for the
> recursion and context switching/execution that will be charged to the
> watchdog and not recorded by the batch).
> -Chris



More information about the igt-dev mailing list