[Intel-gfx] [PATCH v5 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+

Mon Mar 25 10:00:55 UTC 2019

On 22/03/2019 23:41, Carlos Santa wrote:
> From: Michel Thierry <michel.thierry at intel.com>
> 
> *** General ***
> 
> Watchdog timeout (or "media engine reset") is a feature that allows
> userland applications to enable hang detection on individual batch buffers.
> The detection mechanism itself is mostly bound to the hardware and the only
> thing that the driver needs to do to support this form of hang detection
> is to implement the interrupt handling support as well as watchdog command
> emission before and after the emitted batch buffer start instruction in the
> ring buffer.
> 
> The principle of the hang detection mechanism is as follows:
> 
> 1. Once the decision has been made to enable watchdog timeout for a
> particular batch buffer and the driver is in the process of emitting the
> batch buffer start instruction into the ring buffer it also emits a
> watchdog timer start instruction before and a watchdog timer cancellation
> instruction after the batch buffer start instruction in the ring buffer.
> 
> 2. Once the GPU execution reaches the watchdog timer start instruction
> the hardware watchdog counter is started by the hardware. The counter
> keeps counting until either reaching a previously configured threshold
> value or the timer cancellation instruction is executed.
> 
> 2a. If the counter reaches the threshold value the hardware fires a
> watchdog interrupt that is picked up by the watchdog interrupt handler.
> This means that a hang has been detected and the driver needs to deal with
> it the same way it would deal with a engine hang detected by the periodic
> hang checker. The only difference between the two is that we already blamed
> the active request (to ensure an engine reset).
> 
> 2b. If the batch buffer completes and the execution reaches the watchdog
> cancellation instruction before the watchdog counter reaches its
> threshold value the watchdog is cancelled and nothing more comes of it.
> No hang is detected.
> 
> Note about future interaction with preemption: Preemption could happen
> in a command sequence prior to watchdog counter getting disabled,
> resulting in watchdog being triggered following preemption (e.g. when
> watchdog had been enabled in the low priority batch). The driver will
> need to explicitly disable the watchdog counter as part of the
> preemption sequence.
> 
> *** This patch introduces: ***
> 
> 1. IRQ handler code for watchdog timeout allowing direct hang recovery
> based on hardware-driven hang detection, which then integrates directly
> with the hang recovery path. This is independent of having per-engine reset
> or just full gpu reset.
> 
> 2. Watchdog specific register information.
> 
> Currently the render engine and all available media engines support
> watchdog timeout (VECS is only supported in GEN9). The specifications elude
> to the BCS engine being supported but that is currently not supported by
> this commit.
> 
> Note that the value to stop the counter is different between render and
> non-render engines in GEN8; GEN9 onwards it's the same.
> 
> v2: Move irq handler to tasklet, arm watchdog for a 2nd time to check
> against false-positives.
> 
> v3: Don't use high priority tasklet, use engine_last_submit while
> checking for false-positives. From GEN9 onwards, the stop counter bit is
> the same for all engines.
> 
> v4: Remove unnecessary brackets, use current_seqno to mark the request
> as guilty in the hangcheck/capture code.
> 
> v5: Rebased after RESET_ENGINEs flag.
> 
> v6: Don't capture error state in case of watchdog timeout. The capture
> process is time consuming and this will align to what happens when we
> use GuC to handle the watchdog timeout. (Chris)
> 
> v7: Rebase.
> 
> v8: Rebase, use HZ to reschedule.
> 
> v9: Rebase, get forcewake domains in function (no longer in execlists
> struct).
> 
> v10: Rebase.
> 
> v11: Rebase,
>       remove extra braces (Tvrtko),
>       implement watchdog_to_clock_counts helper (Tvrtko),
>       Move tasklet_kill(watchdog_tasklet) inside intel_engines (Tvrtko),
>       Use a global heartbeat seqno instead of engine seqno (Chris)
>       Make all engines checks all class based checks (Tvrtko)
> 
> v12: Rebase,
>       Reset immediately upon entering the IRQ (Chris)
>       Make reset_engine_to_str a helper (Tvrtko)
>       Rename watchdog_irq_handler as watchdog_tasklet (Tvrtko)
>       Let the compiler itself do the inline (Tvrtko)
> 
> v13: Rebase
> v14: Rebase, skip checking for the guilty seqno in the tasklet (Tvrtko)

IIRC I only asked about it so I guess you tried it and it works well?

Can you also post the IGTs so we can see the test coverage?

Regards,

Tvrtko

> 
> Cc: Antonio Argenziano <antonio.argenziano at intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry at intel.com>
> Signed-off-by: Carlos Santa <carlos.santa at intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gpu_error.h     |  4 ++
>   drivers/gpu/drm/i915/i915_irq.c           | 14 ++++--
>   drivers/gpu/drm/i915/i915_reg.h           |  6 +++
>   drivers/gpu/drm/i915/i915_reset.c         | 20 +++++++++
>   drivers/gpu/drm/i915/i915_reset.h         |  6 +++
>   drivers/gpu/drm/i915/intel_engine_cs.c    |  1 +
>   drivers/gpu/drm/i915/intel_engine_types.h |  5 +++
>   drivers/gpu/drm/i915/intel_hangcheck.c    | 11 +----
>   drivers/gpu/drm/i915/intel_lrc.c          | 52 +++++++++++++++++++++++
>   9 files changed, 107 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
> index 99d6b7b270c2..6cf6a8679b26 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.h
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.h
> @@ -203,6 +203,9 @@ struct i915_gpu_error {
>   	 * any global resources that may be clobber by the reset (such as
>   	 * FENCE registers).
>   	 *
> +	 * #I915_RESET_WATCHDOG - When hw detects a hang before us, we can use
> +	 * I915_RESET_WATCHDOG to report the hang detection cause accurately.
> +	 *
>   	 * #I915_RESET_ENGINE[num_engines] - Since the driver doesn't need to
>   	 * acquire the struct_mutex to reset an engine, we need an explicit
>   	 * flag to prevent two concurrent reset attempts in the same engine.
> @@ -218,6 +221,7 @@ struct i915_gpu_error {
>   #define I915_RESET_BACKOFF	0
>   #define I915_RESET_MODESET	1
>   #define I915_RESET_ENGINE	2
> +#define I915_RESET_WATCHDOG	3
>   #define I915_WEDGED		(BITS_PER_LONG - 1)
>   
>   	/** Number of times the device has been reset (global) */
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 82d487189a34..e64994be25c3 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1466,6 +1466,9 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir)
>   
>   	if (tasklet)
>   		tasklet_hi_schedule(&engine->execlists.tasklet);
> +
> +	if (iir & GT_GEN8_WATCHDOG_INTERRUPT)
> +		tasklet_schedule(&engine->execlists.watchdog_tasklet);
>   }
>   
>   static void gen8_gt_irq_ack(struct drm_i915_private *i915,
> @@ -3892,20 +3895,25 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
>   	u32 gt_interrupts[] = {
>   		(GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
>   		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
> +		 GT_GEN8_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
>   		 GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT |
>   		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT),
> -
>   		(GT_RENDER_USER_INTERRUPT << GEN8_VCS0_IRQ_SHIFT |
>   		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS0_IRQ_SHIFT |
> +		 GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS0_IRQ_SHIFT |
>   		 GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
> -		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT),
> -
> +		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
> +		 GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT),
>   		0,
>   
>   		(GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT |
>   		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VECS_IRQ_SHIFT)
>   	};
>   
> +	/* VECS watchdog is only available in skl+ */
> +	if (INTEL_GEN(dev_priv) >= 9)
> +		gt_interrupts[3] |= GT_GEN8_WATCHDOG_INTERRUPT;
> +
>   	dev_priv->pm_ier = 0x0;
>   	dev_priv->pm_imr = ~dev_priv->pm_ier;
>   	GEN8_IRQ_INIT_NDX(GT, 0, ~gt_interrupts[0], gt_interrupts[0]);
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 9b69cec21f7b..ac8d984e16ae 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -2363,6 +2363,11 @@ enum i915_power_well_id {
>   #define RING_START(base)	_MMIO((base) + 0x38)
>   #define RING_CTL(base)		_MMIO((base) + 0x3c)
>   #define   RING_CTL_SIZE(size)	((size) - PAGE_SIZE) /* in bytes -> pages */
> +#define RING_CNTR(base)		_MMIO((base) + 0x178)
> +#define GEN8_WATCHDOG_ENABLE		0
> +#define GEN8_WATCHDOG_DISABLE		1
> +#define GEN8_XCS_WATCHDOG_DISABLE	0xFFFFFFFF /* GEN8 & non-render only */
> +#define RING_THRESH(base)	_MMIO((base) + 0x17C)
>   #define RING_SYNC_0(base)	_MMIO((base) + 0x40)
>   #define RING_SYNC_1(base)	_MMIO((base) + 0x44)
>   #define RING_SYNC_2(base)	_MMIO((base) + 0x48)
> @@ -2925,6 +2930,7 @@ enum i915_power_well_id {
>   #define GT_BSD_USER_INTERRUPT			(1 << 12)
>   #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1	(1 << 11) /* hsw+; rsvd on snb, ivb, vlv */
>   #define GT_CONTEXT_SWITCH_INTERRUPT		(1 <<  8)
> +#define GT_GEN8_WATCHDOG_INTERRUPT		(1 <<  6) /* gen8+ */
>   #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT	(1 <<  5) /* !snb */
>   #define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT	(1 <<  4)
>   #define GT_RENDER_CS_MASTER_ERROR_INTERRUPT	(1 <<  3)
> diff --git a/drivers/gpu/drm/i915/i915_reset.c b/drivers/gpu/drm/i915/i915_reset.c
> index 861fe083e383..739fa5ad1a8d 100644
> --- a/drivers/gpu/drm/i915/i915_reset.c
> +++ b/drivers/gpu/drm/i915/i915_reset.c
> @@ -1208,6 +1208,26 @@ void i915_clear_error_registers(struct drm_i915_private *dev_priv)
>   	}
>   }
>   
> +void engine_reset_error_to_str(struct drm_i915_private *i915,
> +	           char *msg,
> +	           size_t sz,
> +	           unsigned int hung,
> +	           unsigned int stuck,
> +	           unsigned int watchdog)
> +{
> +	int len;
> +	unsigned int tmp;
> +	struct intel_engine_cs *engine;
> +
> +	len = scnprintf(msg, sz,
> +			"%s on ", watchdog ? "watchdog timeout" :
> +				stuck == hung ? "no_progress" : "hang");
> +	for_each_engine_masked(engine, i915, hung, tmp)
> +		len += scnprintf(msg + len, sz - len,
> +				"%s, ", engine->name);
> +	msg[len-2] = '\0';
> +}
> +
>   /**
>    * i915_handle_error - handle a gpu error
>    * @i915: i915 device private
> diff --git a/drivers/gpu/drm/i915/i915_reset.h b/drivers/gpu/drm/i915/i915_reset.h
> index 16f2389f656f..8582d1242248 100644
> --- a/drivers/gpu/drm/i915/i915_reset.h
> +++ b/drivers/gpu/drm/i915/i915_reset.h
> @@ -20,6 +20,12 @@ void i915_handle_error(struct drm_i915_private *i915,
>   		       u32 engine_mask,
>   		       unsigned long flags,
>   		       const char *fmt, ...);
> +void engine_reset_error_to_str(struct drm_i915_private *i915,
> +               char *str,
> +               size_t sz,
> +               unsigned int hung,
> +               unsigned int stuck,
> +               unsigned int watchdog);
>   #define I915_ERROR_CAPTURE BIT(0)
>   
>   void i915_clear_error_registers(struct drm_i915_private *i915);
> diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
> index 652c1b3ba190..88cf0fc07623 100644
> --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> @@ -1149,6 +1149,7 @@ void intel_engines_park(struct drm_i915_private *i915)
>   		/* Flush the residual irq tasklets first. */
>   		intel_engine_disarm_breadcrumbs(engine);
>   		tasklet_kill(&engine->execlists.tasklet);
> +		tasklet_kill(&engine->execlists.watchdog_tasklet);
>   
>   		/*
>   		 * We are committed now to parking the engines, make sure there
> diff --git a/drivers/gpu/drm/i915/intel_engine_types.h b/drivers/gpu/drm/i915/intel_engine_types.h
> index b0aa1f0d4e47..c4f66b774e7c 100644
> --- a/drivers/gpu/drm/i915/intel_engine_types.h
> +++ b/drivers/gpu/drm/i915/intel_engine_types.h
> @@ -124,6 +124,11 @@ struct intel_engine_execlists {
>   	 */
>   	struct tasklet_struct tasklet;
>   
> +	/*
> +	 * @watchdog_tasklet: stop counter and reschedule hangcheck_work asap
> +	 */
> +	struct tasklet_struct watchdog_tasklet;
> +
>   	/**
>   	 * @default_priolist: priority list for I915_PRIORITY_NORMAL
>   	 */
> diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> index 57ed49dc19c4..4bf26863678c 100644
> --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> @@ -220,22 +220,15 @@ static void hangcheck_declare_hang(struct drm_i915_private *i915,
>   				   unsigned int hung,
>   				   unsigned int stuck)
>   {
> -	struct intel_engine_cs *engine;
>   	char msg[80];
> -	unsigned int tmp;
> -	int len;
> +	size_t len = sizeof(msg);
>   
>   	/* If some rings hung but others were still busy, only
>   	 * blame the hanging rings in the synopsis.
>   	 */
>   	if (stuck != hung)
>   		hung &= ~stuck;
> -	len = scnprintf(msg, sizeof(msg),
> -			"%s on ", stuck == hung ? "no progress" : "hang");
> -	for_each_engine_masked(engine, i915, hung, tmp)
> -		len += scnprintf(msg + len, sizeof(msg) - len,
> -				 "%s, ", engine->name);
> -	msg[len-2] = '\0';
> +	engine_reset_error_to_str(i915, msg, len, hung, stuck, 0);
>   
>   	return i915_handle_error(i915, hung, I915_ERROR_CAPTURE, "%s", msg);
>   }
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index e54e0064b2d6..85785a94f6ae 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -2195,6 +2195,40 @@ static int gen8_emit_flush_render(struct i915_request *request,
>   	return 0;
>   }
>   
> +static void gen8_watchdog_tasklet(unsigned long data)
> +{
> +	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
> +	struct drm_i915_private *dev_priv = engine->i915;
> +	enum forcewake_domains fw_domains;
> +	char msg[80];
> +	size_t len = sizeof(msg);
> +	unsigned long *lock = &engine->i915->gpu_error.flags;
> +	unsigned int bit = I915_RESET_ENGINE + engine->id;
> +
> +	switch (engine->class) {
> +	default:
> +		MISSING_CASE(engine->id);
> +		/* fall through */
> +	case RENDER_CLASS:
> +		fw_domains = FORCEWAKE_RENDER;
> +		break;
> +	case VIDEO_DECODE_CLASS:
> +	case VIDEO_ENHANCEMENT_CLASS:
> +		fw_domains = FORCEWAKE_MEDIA;
> +		break;
> +	}
> +
> +	intel_uncore_forcewake_get(dev_priv, fw_domains);
> +
> +	if (!test_and_set_bit(bit, lock)) {
> +		unsigned int hung = engine->mask;
> +		engine_reset_error_to_str(dev_priv, msg, len, hung, 0, 1);
> +		i915_reset_engine(engine, msg);
> +		clear_bit(bit, lock);
> +		wake_up_bit(lock, bit);
> +	}
> +}
> +
>   /*
>    * Reserve space for 2 NOOPs at the end of each request to be
>    * used as a workaround for not being allowed to do lite
> @@ -2377,6 +2411,21 @@ logical_ring_default_irqs(struct intel_engine_cs *engine)
>   
>   	engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift;
>   	engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift;
> +
> +	switch (engine->class) {
> +	default:
> +		/* BCS engine does not support hw watchdog */
> +		break;
> +	case RENDER_CLASS:
> +	case VIDEO_DECODE_CLASS:
> +		engine->irq_keep_mask |= GT_GEN8_WATCHDOG_INTERRUPT << shift;
> +		break;
> +	case VIDEO_ENHANCEMENT_CLASS:
> +		if (INTEL_GEN(engine->i915) >= 9)
> +			engine->irq_keep_mask |=
> +				GT_GEN8_WATCHDOG_INTERRUPT << shift;
> +		break;
> +	}
>   }
>   
>   static int
> @@ -2394,6 +2443,9 @@ logical_ring_setup(struct intel_engine_cs *engine)
>   	tasklet_init(&engine->execlists.tasklet,
>   		     execlists_submission_tasklet, (unsigned long)engine);
>   
> +	tasklet_init(&engine->execlists.watchdog_tasklet,
> +		     gen8_watchdog_tasklet, (unsigned long)engine);
> +
>   	logical_ring_default_vfuncs(engine);
>   	logical_ring_default_irqs(engine);
>   
>