[Intel-gfx] [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+

Thu Feb 28 17:38:11 UTC 2019

On 21/02/2019 02:58, Carlos Santa wrote:
> From: Michel Thierry <michel.thierry at intel.com>
> 
> *** General ***
> 
> Watchdog timeout (or "media engine reset") is a feature that allows
> userland applications to enable hang detection on individual batch buffers.
> The detection mechanism itself is mostly bound to the hardware and the only
> thing that the driver needs to do to support this form of hang detection
> is to implement the interrupt handling support as well as watchdog command
> emission before and after the emitted batch buffer start instruction in the
> ring buffer.
> 
> The principle of the hang detection mechanism is as follows:
> 
> 1. Once the decision has been made to enable watchdog timeout for a
> particular batch buffer and the driver is in the process of emitting the
> batch buffer start instruction into the ring buffer it also emits a
> watchdog timer start instruction before and a watchdog timer cancellation
> instruction after the batch buffer start instruction in the ring buffer.
> 
> 2. Once the GPU execution reaches the watchdog timer start instruction
> the hardware watchdog counter is started by the hardware. The counter
> keeps counting until either reaching a previously configured threshold
> value or the timer cancellation instruction is executed.
> 
> 2a. If the counter reaches the threshold value the hardware fires a
> watchdog interrupt that is picked up by the watchdog interrupt handler.
> This means that a hang has been detected and the driver needs to deal with
> it the same way it would deal with a engine hang detected by the periodic
> hang checker. The only difference between the two is that we already blamed
> the active request (to ensure an engine reset).
> 
> 2b. If the batch buffer completes and the execution reaches the watchdog
> cancellation instruction before the watchdog counter reaches its
> threshold value the watchdog is cancelled and nothing more comes of it.
> No hang is detected.
> 
> Note about future interaction with preemption: Preemption could happen
> in a command sequence prior to watchdog counter getting disabled,
> resulting in watchdog being triggered following preemption (e.g. when
> watchdog had been enabled in the low priority batch). The driver will
> need to explicitly disable the watchdog counter as part of the
> preemption sequence.
> 
> *** This patch introduces: ***
> 
> 1. IRQ handler code for watchdog timeout allowing direct hang recovery
> based on hardware-driven hang detection, which then integrates directly
> with the hang recovery path. This is independent of having per-engine reset
> or just full gpu reset.
> 
> 2. Watchdog specific register information.
> 
> Currently the render engine and all available media engines support
> watchdog timeout (VECS is only supported in GEN9). The specifications elude
> to the BCS engine being supported but that is currently not supported by
> this commit.
> 
> Note that the value to stop the counter is different between render and
> non-render engines in GEN8; GEN9 onwards it's the same.
> 
> v2: Move irq handler to tasklet, arm watchdog for a 2nd time to check
> against false-positives.
> 
> v3: Don't use high priority tasklet, use engine_last_submit while
> checking for false-positives. From GEN9 onwards, the stop counter bit is
> the same for all engines.
> 
> v4: Remove unnecessary brackets, use current_seqno to mark the request
> as guilty in the hangcheck/capture code.
> 
> v5: Rebased after RESET_ENGINEs flag.
> 
> v6: Don't capture error state in case of watchdog timeout. The capture
> process is time consuming and this will align to what happens when we
> use GuC to handle the watchdog timeout. (Chris)
> 
> v7: Rebase.
> 
> v8: Rebase, use HZ to reschedule.
> 
> v9: Rebase, get forcewake domains in function (no longer in execlists
> struct).
> 
> v10: Rebase.
> 
> v11: Rebase,
>       remove extra braces (Tvrtko),
>       implement watchdog_to_clock_counts helper (Tvrtko),
>       Move tasklet_kill(watchdog_tasklet) inside intel_engines (Tvrtko),
>       Use a global heartbeat seqno instead of engine seqno (Chris)
>       Make all engines checks all class based checks (Tvrtko)
> 
> Cc: Antonio Argenziano <antonio.argenziano at intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry at intel.com>
> Signed-off-by: Carlos Santa <carlos.santa at intel.com>
> ---
>   drivers/gpu/drm/i915/i915_drv.h         |  8 +++
>   drivers/gpu/drm/i915/i915_gpu_error.h   |  4 ++
>   drivers/gpu/drm/i915/i915_irq.c         | 12 ++++-
>   drivers/gpu/drm/i915/i915_reg.h         |  6 +++
>   drivers/gpu/drm/i915/intel_engine_cs.c  |  1 +
>   drivers/gpu/drm/i915/intel_hangcheck.c  | 17 +++++--
>   drivers/gpu/drm/i915/intel_lrc.c        | 65 +++++++++++++++++++++++++
>   drivers/gpu/drm/i915/intel_ringbuffer.h |  7 +++
>   8 files changed, 114 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 63a008aebfcd..0fcb2df869a2 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -3120,6 +3120,14 @@ i915_gem_context_lookup(struct drm_i915_file_private *file_priv, u32 id)
>   	return ctx;
>   }
>   
> +static inline u32
> +watchdog_to_clock_counts(struct drm_i915_private *dev_priv, u64 value_in_us)
> +{
> +	u64 threshold = 0;
> +
> +	return threshold;
> +}
> +
>   int i915_perf_open_ioctl(struct drm_device *dev, void *data,
>   			 struct drm_file *file);
>   int i915_perf_add_config_ioctl(struct drm_device *dev, void *data,
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
> index f408060e0667..bd1821c73ecd 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.h
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.h
> @@ -233,6 +233,9 @@ struct i915_gpu_error {
>   	 * i915_mutex_lock_interruptible()?). I915_RESET_BACKOFF serves a
>   	 * secondary role in preventing two concurrent global reset attempts.
>   	 *
> +	 * #I915_RESET_WATCHDOG - When hw detects a hang before us, we can use
> +	 * I915_RESET_WATCHDOG to report the hang detection cause accurately.
> +	 *
>   	 * #I915_RESET_ENGINE[num_engines] - Since the driver doesn't need to
>   	 * acquire the struct_mutex to reset an engine, we need an explicit
>   	 * flag to prevent two concurrent reset attempts in the same engine.
> @@ -248,6 +251,7 @@ struct i915_gpu_error {
>   #define I915_RESET_BACKOFF	0
>   #define I915_RESET_MODESET	1
>   #define I915_RESET_ENGINE	2
> +#define I915_RESET_WATCHDOG	3
>   #define I915_WEDGED		(BITS_PER_LONG - 1)
>   
>   	/** Number of times an engine has been reset */
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 4b23b2fd1fad..e2a1a07b0f2c 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1456,6 +1456,9 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir)
>   
>   	if (tasklet)
>   		tasklet_hi_schedule(&engine->execlists.tasklet);
> +
> +	if (iir & GT_GEN8_WATCHDOG_INTERRUPT)
> +		tasklet_schedule(&engine->execlists.watchdog_tasklet);
>   }
>   
>   static void gen8_gt_irq_ack(struct drm_i915_private *i915,
> @@ -3883,17 +3886,24 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
>   	u32 gt_interrupts[] = {
>   		GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
>   			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
> +			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
>   			GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT |
>   			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT,
>   		GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
>   			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
> +			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
>   			GT_RENDER_USER_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
> -			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT,
> +			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
> +			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS2_IRQ_SHIFT,
>   		0,
>   		GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT |
>   			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VECS_IRQ_SHIFT
>   		};
>   
> +	/* VECS watchdog is only available in skl+ */
> +	if (INTEL_GEN(dev_priv) >= 9)
> +		gt_interrupts[3] |= GT_GEN8_WATCHDOG_INTERRUPT;
> +
>   	dev_priv->pm_ier = 0x0;
>   	dev_priv->pm_imr = ~dev_priv->pm_ier;
>   	GEN8_IRQ_INIT_NDX(GT, 0, ~gt_interrupts[0], gt_interrupts[0]);
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 1eca166d95bb..a0e101bbcbce 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -2335,6 +2335,11 @@ enum i915_power_well_id {
>   #define RING_START(base)	_MMIO((base) + 0x38)
>   #define RING_CTL(base)		_MMIO((base) + 0x3c)
>   #define   RING_CTL_SIZE(size)	((size) - PAGE_SIZE) /* in bytes -> pages */
> +#define RING_CNTR(base)		_MMIO((base) + 0x178)
> +#define   GEN8_WATCHDOG_ENABLE		0
> +#define   GEN8_WATCHDOG_DISABLE		1
> +#define   GEN8_XCS_WATCHDOG_DISABLE	0xFFFFFFFF /* GEN8 & non-render only */
> +#define RING_THRESH(base)	_MMIO((base) + 0x17C)
>   #define RING_SYNC_0(base)	_MMIO((base) + 0x40)
>   #define RING_SYNC_1(base)	_MMIO((base) + 0x44)
>   #define RING_SYNC_2(base)	_MMIO((base) + 0x48)
> @@ -2894,6 +2899,7 @@ enum i915_power_well_id {
>   #define GT_BSD_USER_INTERRUPT			(1 << 12)
>   #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1	(1 << 11) /* hsw+; rsvd on snb, ivb, vlv */
>   #define GT_CONTEXT_SWITCH_INTERRUPT		(1 <<  8)
> +#define GT_GEN8_WATCHDOG_INTERRUPT		(1 <<  6) /* gen8+ */
>   #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT	(1 <<  5) /* !snb */
>   #define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT	(1 <<  4)
>   #define GT_RENDER_CS_MASTER_ERROR_INTERRUPT	(1 <<  3)
> diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
> index 7ae753358a6d..74f563d23cc8 100644
> --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> @@ -1106,6 +1106,7 @@ void intel_engines_park(struct drm_i915_private *i915)
>   		/* Flush the residual irq tasklets first. */
>   		intel_engine_disarm_breadcrumbs(engine);
>   		tasklet_kill(&engine->execlists.tasklet);
> +		tasklet_kill(&engine->execlists.watchdog_tasklet);
>   
>   		/*
>   		 * We are committed now to parking the engines, make sure there
> diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> index 58b6ff8453dc..bc10acb24d9a 100644
> --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> @@ -218,7 +218,8 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine,
>   
>   static void hangcheck_declare_hang(struct drm_i915_private *i915,
>   				   unsigned int hung,
> -				   unsigned int stuck)
> +				   unsigned int stuck,
> +				   unsigned int watchdog)
>   {
>   	struct intel_engine_cs *engine;
>   	char msg[80];
> @@ -231,13 +232,16 @@ static void hangcheck_declare_hang(struct drm_i915_private *i915,
>   	if (stuck != hung)
>   		hung &= ~stuck;
>   	len = scnprintf(msg, sizeof(msg),
> -			"%s on ", stuck == hung ? "no progress" : "hang");
> +			"%s on ", watchdog ? "watchdog timeout" :
> +				  stuck == hung ? "no progress" : "hang");
>   	for_each_engine_masked(engine, i915, hung, tmp)
>   		len += scnprintf(msg + len, sizeof(msg) - len,
>   				 "%s, ", engine->name);
>   	msg[len-2] = '\0';
>   
> -	return i915_handle_error(i915, hung, I915_ERROR_CAPTURE, "%s", msg);
> +	return i915_handle_error(i915, hung,
> +				 watchdog ? 0 : I915_ERROR_CAPTURE,
> +				 "%s", msg);
>   }
>   
>   /*
> @@ -255,7 +259,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>   			     gpu_error.hangcheck_work.work);
>   	struct intel_engine_cs *engine;
>   	enum intel_engine_id id;
> -	unsigned int hung = 0, stuck = 0, wedged = 0;
> +	unsigned int hung = 0, stuck = 0, wedged = 0, watchdog = 0;
>   
>   	if (!i915_modparams.enable_hangcheck)
>   		return;
> @@ -266,6 +270,9 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>   	if (i915_terminally_wedged(&dev_priv->gpu_error))
>   		return;
>   
> +	if (test_and_clear_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags))
> +		watchdog = 1;
> +
>   	/* As enabling the GPU requires fairly extensive mmio access,
>   	 * periodically arm the mmio checker to see if we are triggering
>   	 * any invalid access.
> @@ -311,7 +318,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>   	}
>   
>   	if (hung)
> -		hangcheck_declare_hang(dev_priv, hung, stuck);
> +		hangcheck_declare_hang(dev_priv, hung, stuck, watchdog);
>   
>   	/* Reset timer in case GPU hangs without another request being added */
>   	i915_queue_hangcheck(dev_priv);
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 9ca7dc7a6fa5..c38b239ab39e 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -2352,6 +2352,53 @@ static int gen8_emit_flush_render(struct i915_request *request,
>   	return 0;
>   }
>   
> +/* From GEN9 onwards, all engines use the same RING_CNTR format */
> +static inline u32 get_watchdog_disable(struct intel_engine_cs *engine)

I'd let the compiler decide on the inline or not.

> +{
> +	if (engine->id == RCS || INTEL_GEN(engine->i915) >= 9)
> +		return GEN8_WATCHDOG_DISABLE;
> +	else
> +		return GEN8_XCS_WATCHDOG_DISABLE;
> +}
> +
> +#define GEN8_WATCHDOG_1000US(dev_priv) watchdog_to_clock_counts(dev_priv, 1000)

Not sure macro is useful.

> +static void gen8_watchdog_irq_handler(unsigned long data)

gen8_watchdog_tasklet I guess.

> +{
> +	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
> +	struct drm_i915_private *dev_priv = engine->i915;
> +	unsigned int hung = 0;
> +	u32 current_seqno=0;

Coding style.

> +	char msg[80];
> +	unsigned int tmp;
> +	int len;
> +
> +	/* Stop the counter to prevent further timeout interrupts */
> +	I915_WRITE_FW(RING_CNTR(engine->mmio_base), get_watchdog_disable(engine));

These registers do not need forcewake?

> +
> +	/* Read the heartbeat seqno once again to check if we are stuck? */
> +	current_seqno = intel_engine_get_hangcheck_seqno(engine);
> +
> +    if (current_seqno == engine->current_seqno) {
> +		hung |= engine->mask;
> +
> +		len = scnprintf(msg, sizeof(msg), "%s on ", "watchdog timeout");
> +		for_each_engine_masked(engine, dev_priv, hung, tmp)
> +			len += scnprintf(msg + len, sizeof(msg) - len,
> +					 "%s, ", engine->name);
> +		msg[len-2] = '\0';

Copy/paste from intel_hangcheck.c ? Moving to common helper would be good.

> +
> +		i915_handle_error(dev_priv, hung, 0, "%s", msg);
> +
> +		/* Reset timer in case GPU hangs without another request being added */
> +		i915_queue_hangcheck(dev_priv);

Mis-indented block.

> +    }else{

Coding style.

> +		/* Re-start the counter, if really hung, it will expire again */
> +		I915_WRITE_FW(RING_THRESH(engine->mmio_base),
> +			      GEN8_WATCHDOG_1000US(dev_priv));
> +		I915_WRITE_FW(RING_CNTR(engine->mmio_base), GEN8_WATCHDOG_ENABLE);
> +    }
> +}
> +
>   /*
>    * Reserve space for 2 NOOPs at the end of each request to be
>    * used as a workaround for not being allowed to do lite
> @@ -2539,6 +2586,21 @@ logical_ring_default_irqs(struct intel_engine_cs *engine)
>   
>   	engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift;
>   	engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift;
> +
> +	switch (engine->class) {
> +	default:
> +		/* BCS engine does not support hw watchdog */
> +		break;
> +	case RENDER_CLASS:
> +	case VIDEO_DECODE_CLASS:
> +		engine->irq_keep_mask |= GT_GEN8_WATCHDOG_INTERRUPT << shift;
> +		break;
> +	case VIDEO_ENHANCEMENT_CLASS:
> +		if (INTEL_GEN(engine->i915) >= 9)
> +			engine->irq_keep_mask |=
> +				GT_GEN8_WATCHDOG_INTERRUPT << shift;
> +		break;
> +	}
>   }
>   
>   static int
> @@ -2556,6 +2618,9 @@ logical_ring_setup(struct intel_engine_cs *engine)
>   	tasklet_init(&engine->execlists.tasklet,
>   		     execlists_submission_tasklet, (unsigned long)engine);
>   
> +	tasklet_init(&engine->execlists.watchdog_tasklet,
> +		     gen8_watchdog_irq_handler, (unsigned long)engine);
> +
>   	logical_ring_default_vfuncs(engine);
>   	logical_ring_default_irqs(engine);
>   
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index 465094e38d32..17250ba0246f 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -122,6 +122,7 @@ struct intel_engine_hangcheck {
>   	u64 acthd;
>   	u32 last_seqno;
>   	u32 next_seqno;
> +	u32 watchdog;

Looks unused.

>   	unsigned long action_timestamp;
>   	struct intel_instdone instdone;
>   };
> @@ -222,6 +223,11 @@ struct intel_engine_execlists {
>   	 */
>   	struct tasklet_struct tasklet;
>   
> +	/**
> +	 * @watchdog_tasklet: stop counter and re-schedule hangcheck_work asap
> +	 */
> +	struct tasklet_struct watchdog_tasklet;
> +
>   	/**
>   	 * @default_priolist: priority list for I915_PRIORITY_NORMAL
>   	 */
> @@ -353,6 +359,7 @@ struct intel_engine_cs {
>   	unsigned int hw_id;
>   	unsigned int guc_id;
>   	unsigned long mask;
> +	u32 current_seqno;

I don't see where this is set in this patch?

And I'd recommend calling it watchdog_last_seqno or something along 
those lines so it is obvious it is not a fundamental part of the engine.

>   
>   	u8 uabi_class;
>   
> 

Regards,

Tvrtko