[Intel-gfx] [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Thu Feb 28 17:38:11 UTC 2019
On 21/02/2019 02:58, Carlos Santa wrote:
> From: Michel Thierry <michel.thierry at intel.com>
>
> *** General ***
>
> Watchdog timeout (or "media engine reset") is a feature that allows
> userland applications to enable hang detection on individual batch buffers.
> The detection mechanism itself is mostly bound to the hardware and the only
> thing that the driver needs to do to support this form of hang detection
> is to implement the interrupt handling support as well as watchdog command
> emission before and after the emitted batch buffer start instruction in the
> ring buffer.
>
> The principle of the hang detection mechanism is as follows:
>
> 1. Once the decision has been made to enable watchdog timeout for a
> particular batch buffer and the driver is in the process of emitting the
> batch buffer start instruction into the ring buffer it also emits a
> watchdog timer start instruction before and a watchdog timer cancellation
> instruction after the batch buffer start instruction in the ring buffer.
>
> 2. Once the GPU execution reaches the watchdog timer start instruction
> the hardware watchdog counter is started by the hardware. The counter
> keeps counting until either reaching a previously configured threshold
> value or the timer cancellation instruction is executed.
>
> 2a. If the counter reaches the threshold value the hardware fires a
> watchdog interrupt that is picked up by the watchdog interrupt handler.
> This means that a hang has been detected and the driver needs to deal with
> it the same way it would deal with a engine hang detected by the periodic
> hang checker. The only difference between the two is that we already blamed
> the active request (to ensure an engine reset).
>
> 2b. If the batch buffer completes and the execution reaches the watchdog
> cancellation instruction before the watchdog counter reaches its
> threshold value the watchdog is cancelled and nothing more comes of it.
> No hang is detected.
>
> Note about future interaction with preemption: Preemption could happen
> in a command sequence prior to watchdog counter getting disabled,
> resulting in watchdog being triggered following preemption (e.g. when
> watchdog had been enabled in the low priority batch). The driver will
> need to explicitly disable the watchdog counter as part of the
> preemption sequence.
>
> *** This patch introduces: ***
>
> 1. IRQ handler code for watchdog timeout allowing direct hang recovery
> based on hardware-driven hang detection, which then integrates directly
> with the hang recovery path. This is independent of having per-engine reset
> or just full gpu reset.
>
> 2. Watchdog specific register information.
>
> Currently the render engine and all available media engines support
> watchdog timeout (VECS is only supported in GEN9). The specifications elude
> to the BCS engine being supported but that is currently not supported by
> this commit.
>
> Note that the value to stop the counter is different between render and
> non-render engines in GEN8; GEN9 onwards it's the same.
>
> v2: Move irq handler to tasklet, arm watchdog for a 2nd time to check
> against false-positives.
>
> v3: Don't use high priority tasklet, use engine_last_submit while
> checking for false-positives. From GEN9 onwards, the stop counter bit is
> the same for all engines.
>
> v4: Remove unnecessary brackets, use current_seqno to mark the request
> as guilty in the hangcheck/capture code.
>
> v5: Rebased after RESET_ENGINEs flag.
>
> v6: Don't capture error state in case of watchdog timeout. The capture
> process is time consuming and this will align to what happens when we
> use GuC to handle the watchdog timeout. (Chris)
>
> v7: Rebase.
>
> v8: Rebase, use HZ to reschedule.
>
> v9: Rebase, get forcewake domains in function (no longer in execlists
> struct).
>
> v10: Rebase.
>
> v11: Rebase,
> remove extra braces (Tvrtko),
> implement watchdog_to_clock_counts helper (Tvrtko),
> Move tasklet_kill(watchdog_tasklet) inside intel_engines (Tvrtko),
> Use a global heartbeat seqno instead of engine seqno (Chris)
> Make all engines checks all class based checks (Tvrtko)
>
> Cc: Antonio Argenziano <antonio.argenziano at intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry at intel.com>
> Signed-off-by: Carlos Santa <carlos.santa at intel.com>
> ---
> drivers/gpu/drm/i915/i915_drv.h | 8 +++
> drivers/gpu/drm/i915/i915_gpu_error.h | 4 ++
> drivers/gpu/drm/i915/i915_irq.c | 12 ++++-
> drivers/gpu/drm/i915/i915_reg.h | 6 +++
> drivers/gpu/drm/i915/intel_engine_cs.c | 1 +
> drivers/gpu/drm/i915/intel_hangcheck.c | 17 +++++--
> drivers/gpu/drm/i915/intel_lrc.c | 65 +++++++++++++++++++++++++
> drivers/gpu/drm/i915/intel_ringbuffer.h | 7 +++
> 8 files changed, 114 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 63a008aebfcd..0fcb2df869a2 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -3120,6 +3120,14 @@ i915_gem_context_lookup(struct drm_i915_file_private *file_priv, u32 id)
> return ctx;
> }
>
> +static inline u32
> +watchdog_to_clock_counts(struct drm_i915_private *dev_priv, u64 value_in_us)
> +{
> + u64 threshold = 0;
> +
> + return threshold;
> +}
> +
> int i915_perf_open_ioctl(struct drm_device *dev, void *data,
> struct drm_file *file);
> int i915_perf_add_config_ioctl(struct drm_device *dev, void *data,
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
> index f408060e0667..bd1821c73ecd 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.h
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.h
> @@ -233,6 +233,9 @@ struct i915_gpu_error {
> * i915_mutex_lock_interruptible()?). I915_RESET_BACKOFF serves a
> * secondary role in preventing two concurrent global reset attempts.
> *
> + * #I915_RESET_WATCHDOG - When hw detects a hang before us, we can use
> + * I915_RESET_WATCHDOG to report the hang detection cause accurately.
> + *
> * #I915_RESET_ENGINE[num_engines] - Since the driver doesn't need to
> * acquire the struct_mutex to reset an engine, we need an explicit
> * flag to prevent two concurrent reset attempts in the same engine.
> @@ -248,6 +251,7 @@ struct i915_gpu_error {
> #define I915_RESET_BACKOFF 0
> #define I915_RESET_MODESET 1
> #define I915_RESET_ENGINE 2
> +#define I915_RESET_WATCHDOG 3
> #define I915_WEDGED (BITS_PER_LONG - 1)
>
> /** Number of times an engine has been reset */
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 4b23b2fd1fad..e2a1a07b0f2c 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1456,6 +1456,9 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir)
>
> if (tasklet)
> tasklet_hi_schedule(&engine->execlists.tasklet);
> +
> + if (iir & GT_GEN8_WATCHDOG_INTERRUPT)
> + tasklet_schedule(&engine->execlists.watchdog_tasklet);
> }
>
> static void gen8_gt_irq_ack(struct drm_i915_private *i915,
> @@ -3883,17 +3886,24 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
> u32 gt_interrupts[] = {
> GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
> GT_CONTEXT_SWITCH_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
> + GT_GEN8_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
> GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT |
> GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT,
> GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
> GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
> + GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
> GT_RENDER_USER_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
> - GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT,
> + GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
> + GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS2_IRQ_SHIFT,
> 0,
> GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT |
> GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VECS_IRQ_SHIFT
> };
>
> + /* VECS watchdog is only available in skl+ */
> + if (INTEL_GEN(dev_priv) >= 9)
> + gt_interrupts[3] |= GT_GEN8_WATCHDOG_INTERRUPT;
> +
> dev_priv->pm_ier = 0x0;
> dev_priv->pm_imr = ~dev_priv->pm_ier;
> GEN8_IRQ_INIT_NDX(GT, 0, ~gt_interrupts[0], gt_interrupts[0]);
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 1eca166d95bb..a0e101bbcbce 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -2335,6 +2335,11 @@ enum i915_power_well_id {
> #define RING_START(base) _MMIO((base) + 0x38)
> #define RING_CTL(base) _MMIO((base) + 0x3c)
> #define RING_CTL_SIZE(size) ((size) - PAGE_SIZE) /* in bytes -> pages */
> +#define RING_CNTR(base) _MMIO((base) + 0x178)
> +#define GEN8_WATCHDOG_ENABLE 0
> +#define GEN8_WATCHDOG_DISABLE 1
> +#define GEN8_XCS_WATCHDOG_DISABLE 0xFFFFFFFF /* GEN8 & non-render only */
> +#define RING_THRESH(base) _MMIO((base) + 0x17C)
> #define RING_SYNC_0(base) _MMIO((base) + 0x40)
> #define RING_SYNC_1(base) _MMIO((base) + 0x44)
> #define RING_SYNC_2(base) _MMIO((base) + 0x48)
> @@ -2894,6 +2899,7 @@ enum i915_power_well_id {
> #define GT_BSD_USER_INTERRUPT (1 << 12)
> #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1 (1 << 11) /* hsw+; rsvd on snb, ivb, vlv */
> #define GT_CONTEXT_SWITCH_INTERRUPT (1 << 8)
> +#define GT_GEN8_WATCHDOG_INTERRUPT (1 << 6) /* gen8+ */
> #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT (1 << 5) /* !snb */
> #define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT (1 << 4)
> #define GT_RENDER_CS_MASTER_ERROR_INTERRUPT (1 << 3)
> diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
> index 7ae753358a6d..74f563d23cc8 100644
> --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> @@ -1106,6 +1106,7 @@ void intel_engines_park(struct drm_i915_private *i915)
> /* Flush the residual irq tasklets first. */
> intel_engine_disarm_breadcrumbs(engine);
> tasklet_kill(&engine->execlists.tasklet);
> + tasklet_kill(&engine->execlists.watchdog_tasklet);
>
> /*
> * We are committed now to parking the engines, make sure there
> diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> index 58b6ff8453dc..bc10acb24d9a 100644
> --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> @@ -218,7 +218,8 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine,
>
> static void hangcheck_declare_hang(struct drm_i915_private *i915,
> unsigned int hung,
> - unsigned int stuck)
> + unsigned int stuck,
> + unsigned int watchdog)
> {
> struct intel_engine_cs *engine;
> char msg[80];
> @@ -231,13 +232,16 @@ static void hangcheck_declare_hang(struct drm_i915_private *i915,
> if (stuck != hung)
> hung &= ~stuck;
> len = scnprintf(msg, sizeof(msg),
> - "%s on ", stuck == hung ? "no progress" : "hang");
> + "%s on ", watchdog ? "watchdog timeout" :
> + stuck == hung ? "no progress" : "hang");
> for_each_engine_masked(engine, i915, hung, tmp)
> len += scnprintf(msg + len, sizeof(msg) - len,
> "%s, ", engine->name);
> msg[len-2] = '\0';
>
> - return i915_handle_error(i915, hung, I915_ERROR_CAPTURE, "%s", msg);
> + return i915_handle_error(i915, hung,
> + watchdog ? 0 : I915_ERROR_CAPTURE,
> + "%s", msg);
> }
>
> /*
> @@ -255,7 +259,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
> gpu_error.hangcheck_work.work);
> struct intel_engine_cs *engine;
> enum intel_engine_id id;
> - unsigned int hung = 0, stuck = 0, wedged = 0;
> + unsigned int hung = 0, stuck = 0, wedged = 0, watchdog = 0;
>
> if (!i915_modparams.enable_hangcheck)
> return;
> @@ -266,6 +270,9 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
> if (i915_terminally_wedged(&dev_priv->gpu_error))
> return;
>
> + if (test_and_clear_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags))
> + watchdog = 1;
> +
> /* As enabling the GPU requires fairly extensive mmio access,
> * periodically arm the mmio checker to see if we are triggering
> * any invalid access.
> @@ -311,7 +318,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
> }
>
> if (hung)
> - hangcheck_declare_hang(dev_priv, hung, stuck);
> + hangcheck_declare_hang(dev_priv, hung, stuck, watchdog);
>
> /* Reset timer in case GPU hangs without another request being added */
> i915_queue_hangcheck(dev_priv);
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 9ca7dc7a6fa5..c38b239ab39e 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -2352,6 +2352,53 @@ static int gen8_emit_flush_render(struct i915_request *request,
> return 0;
> }
>
> +/* From GEN9 onwards, all engines use the same RING_CNTR format */
> +static inline u32 get_watchdog_disable(struct intel_engine_cs *engine)
I'd let the compiler decide on the inline or not.
> +{
> + if (engine->id == RCS || INTEL_GEN(engine->i915) >= 9)
> + return GEN8_WATCHDOG_DISABLE;
> + else
> + return GEN8_XCS_WATCHDOG_DISABLE;
> +}
> +
> +#define GEN8_WATCHDOG_1000US(dev_priv) watchdog_to_clock_counts(dev_priv, 1000)
Not sure macro is useful.
> +static void gen8_watchdog_irq_handler(unsigned long data)
gen8_watchdog_tasklet I guess.
> +{
> + struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
> + struct drm_i915_private *dev_priv = engine->i915;
> + unsigned int hung = 0;
> + u32 current_seqno=0;
Coding style.
> + char msg[80];
> + unsigned int tmp;
> + int len;
> +
> + /* Stop the counter to prevent further timeout interrupts */
> + I915_WRITE_FW(RING_CNTR(engine->mmio_base), get_watchdog_disable(engine));
These registers do not need forcewake?
> +
> + /* Read the heartbeat seqno once again to check if we are stuck? */
> + current_seqno = intel_engine_get_hangcheck_seqno(engine);
> +
> + if (current_seqno == engine->current_seqno) {
> + hung |= engine->mask;
> +
> + len = scnprintf(msg, sizeof(msg), "%s on ", "watchdog timeout");
> + for_each_engine_masked(engine, dev_priv, hung, tmp)
> + len += scnprintf(msg + len, sizeof(msg) - len,
> + "%s, ", engine->name);
> + msg[len-2] = '\0';
Copy/paste from intel_hangcheck.c ? Moving to common helper would be good.
> +
> + i915_handle_error(dev_priv, hung, 0, "%s", msg);
> +
> + /* Reset timer in case GPU hangs without another request being added */
> + i915_queue_hangcheck(dev_priv);
Mis-indented block.
> + }else{
Coding style.
> + /* Re-start the counter, if really hung, it will expire again */
> + I915_WRITE_FW(RING_THRESH(engine->mmio_base),
> + GEN8_WATCHDOG_1000US(dev_priv));
> + I915_WRITE_FW(RING_CNTR(engine->mmio_base), GEN8_WATCHDOG_ENABLE);
> + }
> +}
> +
> /*
> * Reserve space for 2 NOOPs at the end of each request to be
> * used as a workaround for not being allowed to do lite
> @@ -2539,6 +2586,21 @@ logical_ring_default_irqs(struct intel_engine_cs *engine)
>
> engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift;
> engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift;
> +
> + switch (engine->class) {
> + default:
> + /* BCS engine does not support hw watchdog */
> + break;
> + case RENDER_CLASS:
> + case VIDEO_DECODE_CLASS:
> + engine->irq_keep_mask |= GT_GEN8_WATCHDOG_INTERRUPT << shift;
> + break;
> + case VIDEO_ENHANCEMENT_CLASS:
> + if (INTEL_GEN(engine->i915) >= 9)
> + engine->irq_keep_mask |=
> + GT_GEN8_WATCHDOG_INTERRUPT << shift;
> + break;
> + }
> }
>
> static int
> @@ -2556,6 +2618,9 @@ logical_ring_setup(struct intel_engine_cs *engine)
> tasklet_init(&engine->execlists.tasklet,
> execlists_submission_tasklet, (unsigned long)engine);
>
> + tasklet_init(&engine->execlists.watchdog_tasklet,
> + gen8_watchdog_irq_handler, (unsigned long)engine);
> +
> logical_ring_default_vfuncs(engine);
> logical_ring_default_irqs(engine);
>
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index 465094e38d32..17250ba0246f 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -122,6 +122,7 @@ struct intel_engine_hangcheck {
> u64 acthd;
> u32 last_seqno;
> u32 next_seqno;
> + u32 watchdog;
Looks unused.
> unsigned long action_timestamp;
> struct intel_instdone instdone;
> };
> @@ -222,6 +223,11 @@ struct intel_engine_execlists {
> */
> struct tasklet_struct tasklet;
>
> + /**
> + * @watchdog_tasklet: stop counter and re-schedule hangcheck_work asap
> + */
> + struct tasklet_struct watchdog_tasklet;
> +
> /**
> * @default_priolist: priority list for I915_PRIORITY_NORMAL
> */
> @@ -353,6 +359,7 @@ struct intel_engine_cs {
> unsigned int hw_id;
> unsigned int guc_id;
> unsigned long mask;
> + u32 current_seqno;
I don't see where this is set in this patch?
And I'd recommend calling it watchdog_last_seqno or something along
those lines so it is obvious it is not a fundamental part of the engine.
>
> u8 uabi_class;
>
>
Regards,
Tvrtko
More information about the Intel-gfx
mailing list