[Intel-gfx] [PATCH v5 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+
Carlos Santa
carlos.santa at intel.com
Wed Mar 27 01:58:19 UTC 2019
On Mon, 2019-03-25 at 10:00 +0000, Tvrtko Ursulin wrote:
> On 22/03/2019 23:41, Carlos Santa wrote:
> > From: Michel Thierry <michel.thierry at intel.com>
> >
> > *** General ***
> >
> > Watchdog timeout (or "media engine reset") is a feature that allows
> > userland applications to enable hang detection on individual batch
> > buffers.
> > The detection mechanism itself is mostly bound to the hardware and
> > the only
> > thing that the driver needs to do to support this form of hang
> > detection
> > is to implement the interrupt handling support as well as watchdog
> > command
> > emission before and after the emitted batch buffer start
> > instruction in the
> > ring buffer.
> >
> > The principle of the hang detection mechanism is as follows:
> >
> > 1. Once the decision has been made to enable watchdog timeout for a
> > particular batch buffer and the driver is in the process of
> > emitting the
> > batch buffer start instruction into the ring buffer it also emits a
> > watchdog timer start instruction before and a watchdog timer
> > cancellation
> > instruction after the batch buffer start instruction in the ring
> > buffer.
> >
> > 2. Once the GPU execution reaches the watchdog timer start
> > instruction
> > the hardware watchdog counter is started by the hardware. The
> > counter
> > keeps counting until either reaching a previously configured
> > threshold
> > value or the timer cancellation instruction is executed.
> >
> > 2a. If the counter reaches the threshold value the hardware fires a
> > watchdog interrupt that is picked up by the watchdog interrupt
> > handler.
> > This means that a hang has been detected and the driver needs to
> > deal with
> > it the same way it would deal with a engine hang detected by the
> > periodic
> > hang checker. The only difference between the two is that we
> > already blamed
> > the active request (to ensure an engine reset).
> >
> > 2b. If the batch buffer completes and the execution reaches the
> > watchdog
> > cancellation instruction before the watchdog counter reaches its
> > threshold value the watchdog is cancelled and nothing more comes of
> > it.
> > No hang is detected.
> >
> > Note about future interaction with preemption: Preemption could
> > happen
> > in a command sequence prior to watchdog counter getting disabled,
> > resulting in watchdog being triggered following preemption (e.g.
> > when
> > watchdog had been enabled in the low priority batch). The driver
> > will
> > need to explicitly disable the watchdog counter as part of the
> > preemption sequence.
> >
> > *** This patch introduces: ***
> >
> > 1. IRQ handler code for watchdog timeout allowing direct hang
> > recovery
> > based on hardware-driven hang detection, which then integrates
> > directly
> > with the hang recovery path. This is independent of having per-
> > engine reset
> > or just full gpu reset.
> >
> > 2. Watchdog specific register information.
> >
> > Currently the render engine and all available media engines support
> > watchdog timeout (VECS is only supported in GEN9). The
> > specifications elude
> > to the BCS engine being supported but that is currently not
> > supported by
> > this commit.
> >
> > Note that the value to stop the counter is different between render
> > and
> > non-render engines in GEN8; GEN9 onwards it's the same.
> >
> > v2: Move irq handler to tasklet, arm watchdog for a 2nd time to
> > check
> > against false-positives.
> >
> > v3: Don't use high priority tasklet, use engine_last_submit while
> > checking for false-positives. From GEN9 onwards, the stop counter
> > bit is
> > the same for all engines.
> >
> > v4: Remove unnecessary brackets, use current_seqno to mark the
> > request
> > as guilty in the hangcheck/capture code.
> >
> > v5: Rebased after RESET_ENGINEs flag.
> >
> > v6: Don't capture error state in case of watchdog timeout. The
> > capture
> > process is time consuming and this will align to what happens when
> > we
> > use GuC to handle the watchdog timeout. (Chris)
> >
> > v7: Rebase.
> >
> > v8: Rebase, use HZ to reschedule.
> >
> > v9: Rebase, get forcewake domains in function (no longer in
> > execlists
> > struct).
> >
> > v10: Rebase.
> >
> > v11: Rebase,
> > remove extra braces (Tvrtko),
> > implement watchdog_to_clock_counts helper (Tvrtko),
> > Move tasklet_kill(watchdog_tasklet) inside intel_engines
> > (Tvrtko),
> > Use a global heartbeat seqno instead of engine seqno (Chris)
> > Make all engines checks all class based checks (Tvrtko)
> >
> > v12: Rebase,
> > Reset immediately upon entering the IRQ (Chris)
> > Make reset_engine_to_str a helper (Tvrtko)
> > Rename watchdog_irq_handler as watchdog_tasklet (Tvrtko)
> > Let the compiler itself do the inline (Tvrtko)
> >
> > v13: Rebase
> > v14: Rebase, skip checking for the guilty seqno in the tasklet
> > (Tvrtko)
>
> IIRC I only asked about it so I guess you tried it and it works well?
>
> Can you also post the IGTs so we can see the test coverage?
>
> Regards,
>
> Tvrtko
Yeah, the unit test works but it's still the very simple one (updated
for the uAPI mods) I started with last August, I still need to add more
to it though...
https://lists.freedesktop.org/archives/igt-dev/2018-September/005834.html
Right now, I am on the hrtimer workaround to see how that works, will
get back to the test coverage after that...
Regards,
Carlos
>
> >
> > Cc: Antonio Argenziano <antonio.argenziano at intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com>
> > Signed-off-by: Michel Thierry <michel.thierry at intel.com>
> > Signed-off-by: Carlos Santa <carlos.santa at intel.com>
> > ---
> > drivers/gpu/drm/i915/i915_gpu_error.h | 4 ++
> > drivers/gpu/drm/i915/i915_irq.c | 14 ++++--
> > drivers/gpu/drm/i915/i915_reg.h | 6 +++
> > drivers/gpu/drm/i915/i915_reset.c | 20 +++++++++
> > drivers/gpu/drm/i915/i915_reset.h | 6 +++
> > drivers/gpu/drm/i915/intel_engine_cs.c | 1 +
> > drivers/gpu/drm/i915/intel_engine_types.h | 5 +++
> > drivers/gpu/drm/i915/intel_hangcheck.c | 11 +----
> > drivers/gpu/drm/i915/intel_lrc.c | 52
> > +++++++++++++++++++++++
> > 9 files changed, 107 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h
> > b/drivers/gpu/drm/i915/i915_gpu_error.h
> > index 99d6b7b270c2..6cf6a8679b26 100644
> > --- a/drivers/gpu/drm/i915/i915_gpu_error.h
> > +++ b/drivers/gpu/drm/i915/i915_gpu_error.h
> > @@ -203,6 +203,9 @@ struct i915_gpu_error {
> > * any global resources that may be clobber by the reset (such
> > as
> > * FENCE registers).
> > *
> > + * #I915_RESET_WATCHDOG - When hw detects a hang before us, we
> > can use
> > + * I915_RESET_WATCHDOG to report the hang detection cause
> > accurately.
> > + *
> > * #I915_RESET_ENGINE[num_engines] - Since the driver doesn't
> > need to
> > * acquire the struct_mutex to reset an engine, we need an
> > explicit
> > * flag to prevent two concurrent reset attempts in the same
> > engine.
> > @@ -218,6 +221,7 @@ struct i915_gpu_error {
> > #define I915_RESET_BACKOFF 0
> > #define I915_RESET_MODESET 1
> > #define I915_RESET_ENGINE 2
> > +#define I915_RESET_WATCHDOG 3
> > #define I915_WEDGED (BITS_PER_LONG - 1)
> >
> > /** Number of times the device has been reset (global) */
> > diff --git a/drivers/gpu/drm/i915/i915_irq.c
> > b/drivers/gpu/drm/i915/i915_irq.c
> > index 82d487189a34..e64994be25c3 100644
> > --- a/drivers/gpu/drm/i915/i915_irq.c
> > +++ b/drivers/gpu/drm/i915/i915_irq.c
> > @@ -1466,6 +1466,9 @@ gen8_cs_irq_handler(struct intel_engine_cs
> > *engine, u32 iir)
> >
> > if (tasklet)
> > tasklet_hi_schedule(&engine->execlists.tasklet);
> > +
> > + if (iir & GT_GEN8_WATCHDOG_INTERRUPT)
> > + tasklet_schedule(&engine->execlists.watchdog_tasklet);
> > }
> >
> > static void gen8_gt_irq_ack(struct drm_i915_private *i915,
> > @@ -3892,20 +3895,25 @@ static void gen8_gt_irq_postinstall(struct
> > drm_i915_private *dev_priv)
> > u32 gt_interrupts[] = {
> > (GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
> > GT_CONTEXT_SWITCH_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
> > + GT_GEN8_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
> > GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT |
> > GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT),
> > -
> > (GT_RENDER_USER_INTERRUPT << GEN8_VCS0_IRQ_SHIFT |
> > GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS0_IRQ_SHIFT |
> > + GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS0_IRQ_SHIFT |
> > GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
> > - GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT),
> > -
> > + GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
> > + GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT),
> > 0,
> >
> > (GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT |
> > GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VECS_IRQ_SHIFT)
> > };
> >
> > + /* VECS watchdog is only available in skl+ */
> > + if (INTEL_GEN(dev_priv) >= 9)
> > + gt_interrupts[3] |= GT_GEN8_WATCHDOG_INTERRUPT;
> > +
> > dev_priv->pm_ier = 0x0;
> > dev_priv->pm_imr = ~dev_priv->pm_ier;
> > GEN8_IRQ_INIT_NDX(GT, 0, ~gt_interrupts[0], gt_interrupts[0]);
> > diff --git a/drivers/gpu/drm/i915/i915_reg.h
> > b/drivers/gpu/drm/i915/i915_reg.h
> > index 9b69cec21f7b..ac8d984e16ae 100644
> > --- a/drivers/gpu/drm/i915/i915_reg.h
> > +++ b/drivers/gpu/drm/i915/i915_reg.h
> > @@ -2363,6 +2363,11 @@ enum i915_power_well_id {
> > #define RING_START(base) _MMIO((base) + 0x38)
> > #define RING_CTL(base) _MMIO((base) + 0x3c)
> > #define RING_CTL_SIZE(size) ((size) - PAGE_SIZE) /* in
> > bytes -> pages */
> > +#define RING_CNTR(base) _MMIO((base) + 0x178)
> > +#define GEN8_WATCHDOG_ENABLE 0
> > +#define GEN8_WATCHDOG_DISABLE 1
> > +#define GEN8_XCS_WATCHDOG_DISABLE 0xFFFFFFFF /* GEN8 & non-render
> > only */
> > +#define RING_THRESH(base) _MMIO((base) + 0x17C)
> > #define RING_SYNC_0(base) _MMIO((base) + 0x40)
> > #define RING_SYNC_1(base) _MMIO((base) + 0x44)
> > #define RING_SYNC_2(base) _MMIO((base) + 0x48)
> > @@ -2925,6 +2930,7 @@ enum i915_power_well_id {
> > #define GT_BSD_USER_INTERRUPT (1 << 12)
> > #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1 (1 << 11) /*
> > hsw+; rsvd on snb, ivb, vlv */
> > #define GT_CONTEXT_SWITCH_INTERRUPT (1 << 8)
> > +#define GT_GEN8_WATCHDOG_INTERRUPT (1 << 6) /* gen8+ */
> > #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT (1 << 5) /*
> > !snb */
> > #define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT (1 << 4)
> > #define GT_RENDER_CS_MASTER_ERROR_INTERRUPT (1 << 3)
> > diff --git a/drivers/gpu/drm/i915/i915_reset.c
> > b/drivers/gpu/drm/i915/i915_reset.c
> > index 861fe083e383..739fa5ad1a8d 100644
> > --- a/drivers/gpu/drm/i915/i915_reset.c
> > +++ b/drivers/gpu/drm/i915/i915_reset.c
> > @@ -1208,6 +1208,26 @@ void i915_clear_error_registers(struct
> > drm_i915_private *dev_priv)
> > }
> > }
> >
> > +void engine_reset_error_to_str(struct drm_i915_private *i915,
> > + char *msg,
> > + size_t sz,
> > + unsigned int hung,
> > + unsigned int stuck,
> > + unsigned int watchdog)
> > +{
> > + int len;
> > + unsigned int tmp;
> > + struct intel_engine_cs *engine;
> > +
> > + len = scnprintf(msg, sz,
> > + "%s on ", watchdog ? "watchdog timeout" :
> > + stuck == hung ? "no_progress" :
> > "hang");
> > + for_each_engine_masked(engine, i915, hung, tmp)
> > + len += scnprintf(msg + len, sz - len,
> > + "%s, ", engine->name);
> > + msg[len-2] = '\0';
> > +}
> > +
> > /**
> > * i915_handle_error - handle a gpu error
> > * @i915: i915 device private
> > diff --git a/drivers/gpu/drm/i915/i915_reset.h
> > b/drivers/gpu/drm/i915/i915_reset.h
> > index 16f2389f656f..8582d1242248 100644
> > --- a/drivers/gpu/drm/i915/i915_reset.h
> > +++ b/drivers/gpu/drm/i915/i915_reset.h
> > @@ -20,6 +20,12 @@ void i915_handle_error(struct drm_i915_private
> > *i915,
> > u32 engine_mask,
> > unsigned long flags,
> > const char *fmt, ...);
> > +void engine_reset_error_to_str(struct drm_i915_private *i915,
> > + char *str,
> > + size_t sz,
> > + unsigned int hung,
> > + unsigned int stuck,
> > + unsigned int watchdog);
> > #define I915_ERROR_CAPTURE BIT(0)
> >
> > void i915_clear_error_registers(struct drm_i915_private *i915);
> > diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c
> > b/drivers/gpu/drm/i915/intel_engine_cs.c
> > index 652c1b3ba190..88cf0fc07623 100644
> > --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> > +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> > @@ -1149,6 +1149,7 @@ void intel_engines_park(struct
> > drm_i915_private *i915)
> > /* Flush the residual irq tasklets first. */
> > intel_engine_disarm_breadcrumbs(engine);
> > tasklet_kill(&engine->execlists.tasklet);
> > + tasklet_kill(&engine->execlists.watchdog_tasklet);
> >
> > /*
> > * We are committed now to parking the engines, make
> > sure there
> > diff --git a/drivers/gpu/drm/i915/intel_engine_types.h
> > b/drivers/gpu/drm/i915/intel_engine_types.h
> > index b0aa1f0d4e47..c4f66b774e7c 100644
> > --- a/drivers/gpu/drm/i915/intel_engine_types.h
> > +++ b/drivers/gpu/drm/i915/intel_engine_types.h
> > @@ -124,6 +124,11 @@ struct intel_engine_execlists {
> > */
> > struct tasklet_struct tasklet;
> >
> > + /*
> > + * @watchdog_tasklet: stop counter and reschedule
> > hangcheck_work asap
> > + */
> > + struct tasklet_struct watchdog_tasklet;
> > +
> > /**
> > * @default_priolist: priority list for I915_PRIORITY_NORMAL
> > */
> > diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c
> > b/drivers/gpu/drm/i915/intel_hangcheck.c
> > index 57ed49dc19c4..4bf26863678c 100644
> > --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> > +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> > @@ -220,22 +220,15 @@ static void hangcheck_declare_hang(struct
> > drm_i915_private *i915,
> > unsigned int hung,
> > unsigned int stuck)
> > {
> > - struct intel_engine_cs *engine;
> > char msg[80];
> > - unsigned int tmp;
> > - int len;
> > + size_t len = sizeof(msg);
> >
> > /* If some rings hung but others were still busy, only
> > * blame the hanging rings in the synopsis.
> > */
> > if (stuck != hung)
> > hung &= ~stuck;
> > - len = scnprintf(msg, sizeof(msg),
> > - "%s on ", stuck == hung ? "no progress" :
> > "hang");
> > - for_each_engine_masked(engine, i915, hung, tmp)
> > - len += scnprintf(msg + len, sizeof(msg) - len,
> > - "%s, ", engine->name);
> > - msg[len-2] = '\0';
> > + engine_reset_error_to_str(i915, msg, len, hung, stuck, 0);
> >
> > return i915_handle_error(i915, hung, I915_ERROR_CAPTURE, "%s",
> > msg);
> > }
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c
> > b/drivers/gpu/drm/i915/intel_lrc.c
> > index e54e0064b2d6..85785a94f6ae 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -2195,6 +2195,40 @@ static int gen8_emit_flush_render(struct
> > i915_request *request,
> > return 0;
> > }
> >
> > +static void gen8_watchdog_tasklet(unsigned long data)
> > +{
> > + struct intel_engine_cs *engine = (struct intel_engine_cs
> > *)data;
> > + struct drm_i915_private *dev_priv = engine->i915;
> > + enum forcewake_domains fw_domains;
> > + char msg[80];
> > + size_t len = sizeof(msg);
> > + unsigned long *lock = &engine->i915->gpu_error.flags;
> > + unsigned int bit = I915_RESET_ENGINE + engine->id;
> > +
> > + switch (engine->class) {
> > + default:
> > + MISSING_CASE(engine->id);
> > + /* fall through */
> > + case RENDER_CLASS:
> > + fw_domains = FORCEWAKE_RENDER;
> > + break;
> > + case VIDEO_DECODE_CLASS:
> > + case VIDEO_ENHANCEMENT_CLASS:
> > + fw_domains = FORCEWAKE_MEDIA;
> > + break;
> > + }
> > +
> > + intel_uncore_forcewake_get(dev_priv, fw_domains);
> > +
> > + if (!test_and_set_bit(bit, lock)) {
> > + unsigned int hung = engine->mask;
> > + engine_reset_error_to_str(dev_priv, msg, len, hung, 0,
> > 1);
> > + i915_reset_engine(engine, msg);
> > + clear_bit(bit, lock);
> > + wake_up_bit(lock, bit);
> > + }
> > +}
> > +
> > /*
> > * Reserve space for 2 NOOPs at the end of each request to be
> > * used as a workaround for not being allowed to do lite
> > @@ -2377,6 +2411,21 @@ logical_ring_default_irqs(struct
> > intel_engine_cs *engine)
> >
> > engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift;
> > engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift;
> > +
> > + switch (engine->class) {
> > + default:
> > + /* BCS engine does not support hw watchdog */
> > + break;
> > + case RENDER_CLASS:
> > + case VIDEO_DECODE_CLASS:
> > + engine->irq_keep_mask |= GT_GEN8_WATCHDOG_INTERRUPT <<
> > shift;
> > + break;
> > + case VIDEO_ENHANCEMENT_CLASS:
> > + if (INTEL_GEN(engine->i915) >= 9)
> > + engine->irq_keep_mask |=
> > + GT_GEN8_WATCHDOG_INTERRUPT << shift;
> > + break;
> > + }
> > }
> >
> > static int
> > @@ -2394,6 +2443,9 @@ logical_ring_setup(struct intel_engine_cs
> > *engine)
> > tasklet_init(&engine->execlists.tasklet,
> > execlists_submission_tasklet, (unsigned
> > long)engine);
> >
> > + tasklet_init(&engine->execlists.watchdog_tasklet,
> > + gen8_watchdog_tasklet, (unsigned long)engine);
> > +
> > logical_ring_default_vfuncs(engine);
> > logical_ring_default_irqs(engine);
> >
> >
More information about the Intel-gfx
mailing list