[Intel-gfx] [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+

Fri Mar 1 01:51:17 UTC 2019

On Thu, 2019-02-28 at 17:38 +0000, Tvrtko Ursulin wrote:
> On 21/02/2019 02:58, Carlos Santa wrote:
> > From: Michel Thierry <michel.thierry at intel.com>
> > 
> > *** General ***
> > 
> > Watchdog timeout (or "media engine reset") is a feature that allows
> > userland applications to enable hang detection on individual batch
> > buffers.
> > The detection mechanism itself is mostly bound to the hardware and
> > the only
> > thing that the driver needs to do to support this form of hang
> > detection
> > is to implement the interrupt handling support as well as watchdog
> > command
> > emission before and after the emitted batch buffer start
> > instruction in the
> > ring buffer.
> > 
> > The principle of the hang detection mechanism is as follows:
> > 
> > 1. Once the decision has been made to enable watchdog timeout for a
> > particular batch buffer and the driver is in the process of
> > emitting the
> > batch buffer start instruction into the ring buffer it also emits a
> > watchdog timer start instruction before and a watchdog timer
> > cancellation
> > instruction after the batch buffer start instruction in the ring
> > buffer.
> > 
> > 2. Once the GPU execution reaches the watchdog timer start
> > instruction
> > the hardware watchdog counter is started by the hardware. The
> > counter
> > keeps counting until either reaching a previously configured
> > threshold
> > value or the timer cancellation instruction is executed.
> > 
> > 2a. If the counter reaches the threshold value the hardware fires a
> > watchdog interrupt that is picked up by the watchdog interrupt
> > handler.
> > This means that a hang has been detected and the driver needs to
> > deal with
> > it the same way it would deal with a engine hang detected by the
> > periodic
> > hang checker. The only difference between the two is that we
> > already blamed
> > the active request (to ensure an engine reset).
> > 
> > 2b. If the batch buffer completes and the execution reaches the
> > watchdog
> > cancellation instruction before the watchdog counter reaches its
> > threshold value the watchdog is cancelled and nothing more comes of
> > it.
> > No hang is detected.
> > 
> > Note about future interaction with preemption: Preemption could
> > happen
> > in a command sequence prior to watchdog counter getting disabled,
> > resulting in watchdog being triggered following preemption (e.g.
> > when
> > watchdog had been enabled in the low priority batch). The driver
> > will
> > need to explicitly disable the watchdog counter as part of the
> > preemption sequence.
> > 
> > *** This patch introduces: ***
> > 
> > 1. IRQ handler code for watchdog timeout allowing direct hang
> > recovery
> > based on hardware-driven hang detection, which then integrates
> > directly
> > with the hang recovery path. This is independent of having per-
> > engine reset
> > or just full gpu reset.
> > 
> > 2. Watchdog specific register information.
> > 
> > Currently the render engine and all available media engines support
> > watchdog timeout (VECS is only supported in GEN9). The
> > specifications elude
> > to the BCS engine being supported but that is currently not
> > supported by
> > this commit.
> > 
> > Note that the value to stop the counter is different between render
> > and
> > non-render engines in GEN8; GEN9 onwards it's the same.
> > 
> > v2: Move irq handler to tasklet, arm watchdog for a 2nd time to
> > check
> > against false-positives.
> > 
> > v3: Don't use high priority tasklet, use engine_last_submit while
> > checking for false-positives. From GEN9 onwards, the stop counter
> > bit is
> > the same for all engines.
> > 
> > v4: Remove unnecessary brackets, use current_seqno to mark the
> > request
> > as guilty in the hangcheck/capture code.
> > 
> > v5: Rebased after RESET_ENGINEs flag.
> > 
> > v6: Don't capture error state in case of watchdog timeout. The
> > capture
> > process is time consuming and this will align to what happens when
> > we
> > use GuC to handle the watchdog timeout. (Chris)
> > 
> > v7: Rebase.
> > 
> > v8: Rebase, use HZ to reschedule.
> > 
> > v9: Rebase, get forcewake domains in function (no longer in
> > execlists
> > struct).
> > 
> > v10: Rebase.
> > 
> > v11: Rebase,
> >       remove extra braces (Tvrtko),
> >       implement watchdog_to_clock_counts helper (Tvrtko),
> >       Move tasklet_kill(watchdog_tasklet) inside intel_engines
> > (Tvrtko),
> >       Use a global heartbeat seqno instead of engine seqno (Chris)
> >       Make all engines checks all class based checks (Tvrtko)
> > 
> > Cc: Antonio Argenziano <antonio.argenziano at intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com>
> > Signed-off-by: Michel Thierry <michel.thierry at intel.com>
> > Signed-off-by: Carlos Santa <carlos.santa at intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_drv.h         |  8 +++
> >   drivers/gpu/drm/i915/i915_gpu_error.h   |  4 ++
> >   drivers/gpu/drm/i915/i915_irq.c         | 12 ++++-
> >   drivers/gpu/drm/i915/i915_reg.h         |  6 +++
> >   drivers/gpu/drm/i915/intel_engine_cs.c  |  1 +
> >   drivers/gpu/drm/i915/intel_hangcheck.c  | 17 +++++--
> >   drivers/gpu/drm/i915/intel_lrc.c        | 65
> > +++++++++++++++++++++++++
> >   drivers/gpu/drm/i915/intel_ringbuffer.h |  7 +++
> >   8 files changed, 114 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_drv.h
> > b/drivers/gpu/drm/i915/i915_drv.h
> > index 63a008aebfcd..0fcb2df869a2 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.h
> > +++ b/drivers/gpu/drm/i915/i915_drv.h
> > @@ -3120,6 +3120,14 @@ i915_gem_context_lookup(struct
> > drm_i915_file_private *file_priv, u32 id)
> >   	return ctx;
> >   }
> >   
> > +static inline u32
> > +watchdog_to_clock_counts(struct drm_i915_private *dev_priv, u64
> > value_in_us)
> > +{
> > +	u64 threshold = 0;
> > +
> > +	return threshold;
> > +}
> > +
> >   int i915_perf_open_ioctl(struct drm_device *dev, void *data,
> >   			 struct drm_file *file);
> >   int i915_perf_add_config_ioctl(struct drm_device *dev, void
> > *data,
> > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h
> > b/drivers/gpu/drm/i915/i915_gpu_error.h
> > index f408060e0667..bd1821c73ecd 100644
> > --- a/drivers/gpu/drm/i915/i915_gpu_error.h
> > +++ b/drivers/gpu/drm/i915/i915_gpu_error.h
> > @@ -233,6 +233,9 @@ struct i915_gpu_error {
> >   	 * i915_mutex_lock_interruptible()?). I915_RESET_BACKOFF serves
> > a
> >   	 * secondary role in preventing two concurrent global reset
> > attempts.
> >   	 *
> > +	 * #I915_RESET_WATCHDOG - When hw detects a hang before us, we
> > can use
> > +	 * I915_RESET_WATCHDOG to report the hang detection cause
> > accurately.
> > +	 *
> >   	 * #I915_RESET_ENGINE[num_engines] - Since the driver doesn't
> > need to
> >   	 * acquire the struct_mutex to reset an engine, we need an
> > explicit
> >   	 * flag to prevent two concurrent reset attempts in the same
> > engine.
> > @@ -248,6 +251,7 @@ struct i915_gpu_error {
> >   #define I915_RESET_BACKOFF	0
> >   #define I915_RESET_MODESET	1
> >   #define I915_RESET_ENGINE	2
> > +#define I915_RESET_WATCHDOG	3
> >   #define I915_WEDGED		(BITS_PER_LONG - 1)
> >   
> >   	/** Number of times an engine has been reset */
> > diff --git a/drivers/gpu/drm/i915/i915_irq.c
> > b/drivers/gpu/drm/i915/i915_irq.c
> > index 4b23b2fd1fad..e2a1a07b0f2c 100644
> > --- a/drivers/gpu/drm/i915/i915_irq.c
> > +++ b/drivers/gpu/drm/i915/i915_irq.c
> > @@ -1456,6 +1456,9 @@ gen8_cs_irq_handler(struct intel_engine_cs
> > *engine, u32 iir)
> >   
> >   	if (tasklet)
> >   		tasklet_hi_schedule(&engine->execlists.tasklet);
> > +
> > +	if (iir & GT_GEN8_WATCHDOG_INTERRUPT)
> > +		tasklet_schedule(&engine->execlists.watchdog_tasklet);
> >   }
> >   
> >   static void gen8_gt_irq_ack(struct drm_i915_private *i915,
> > @@ -3883,17 +3886,24 @@ static void gen8_gt_irq_postinstall(struct
> > drm_i915_private *dev_priv)
> >   	u32 gt_interrupts[] = {
> >   		GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
> >   			GT_CONTEXT_SWITCH_INTERRUPT <<
> > GEN8_RCS_IRQ_SHIFT |
> > +			GT_GEN8_WATCHDOG_INTERRUPT <<
> > GEN8_RCS_IRQ_SHIFT |
> >   			GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT
> > |
> >   			GT_CONTEXT_SWITCH_INTERRUPT <<
> > GEN8_BCS_IRQ_SHIFT,
> >   		GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
> >   			GT_CONTEXT_SWITCH_INTERRUPT <<
> > GEN8_VCS1_IRQ_SHIFT |
> > +			GT_GEN8_WATCHDOG_INTERRUPT <<
> > GEN8_VCS1_IRQ_SHIFT |
> >   			GT_RENDER_USER_INTERRUPT << GEN8_VCS2_IRQ_SHIFT
> > |
> > -			GT_CONTEXT_SWITCH_INTERRUPT <<
> > GEN8_VCS2_IRQ_SHIFT,
> > +			GT_CONTEXT_SWITCH_INTERRUPT <<
> > GEN8_VCS2_IRQ_SHIFT |
> > +			GT_GEN8_WATCHDOG_INTERRUPT <<
> > GEN8_VCS2_IRQ_SHIFT,
> >   		0,
> >   		GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT |
> >   			GT_CONTEXT_SWITCH_INTERRUPT <<
> > GEN8_VECS_IRQ_SHIFT
> >   		};
> >   
> > +	/* VECS watchdog is only available in skl+ */
> > +	if (INTEL_GEN(dev_priv) >= 9)
> > +		gt_interrupts[3] |= GT_GEN8_WATCHDOG_INTERRUPT;
> > +
> >   	dev_priv->pm_ier = 0x0;
> >   	dev_priv->pm_imr = ~dev_priv->pm_ier;
> >   	GEN8_IRQ_INIT_NDX(GT, 0, ~gt_interrupts[0], gt_interrupts[0]);
> > diff --git a/drivers/gpu/drm/i915/i915_reg.h
> > b/drivers/gpu/drm/i915/i915_reg.h
> > index 1eca166d95bb..a0e101bbcbce 100644
> > --- a/drivers/gpu/drm/i915/i915_reg.h
> > +++ b/drivers/gpu/drm/i915/i915_reg.h
> > @@ -2335,6 +2335,11 @@ enum i915_power_well_id {
> >   #define RING_START(base)	_MMIO((base) + 0x38)
> >   #define RING_CTL(base)		_MMIO((base) + 0x3c)
> >   #define   RING_CTL_SIZE(size)	((size) - PAGE_SIZE) /* in
> > bytes -> pages */
> > +#define RING_CNTR(base)		_MMIO((base) + 0x178)
> > +#define   GEN8_WATCHDOG_ENABLE		0
> > +#define   GEN8_WATCHDOG_DISABLE		1
> > +#define   GEN8_XCS_WATCHDOG_DISABLE	0xFFFFFFFF /* GEN8 &
> > non-render only */
> > +#define RING_THRESH(base)	_MMIO((base) + 0x17C)
> >   #define RING_SYNC_0(base)	_MMIO((base) + 0x40)
> >   #define RING_SYNC_1(base)	_MMIO((base) + 0x44)
> >   #define RING_SYNC_2(base)	_MMIO((base) + 0x48)
> > @@ -2894,6 +2899,7 @@ enum i915_power_well_id {
> >   #define GT_BSD_USER_INTERRUPT			(1 << 12)
> >   #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1	(1 << 11) /*
> > hsw+; rsvd on snb, ivb, vlv */
> >   #define GT_CONTEXT_SWITCH_INTERRUPT		(1 <<  8)
> > +#define GT_GEN8_WATCHDOG_INTERRUPT		(1 <<  6) /* gen8+ */
> >   #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT	(1 <<  5) /*
> > !snb */
> >   #define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT	(1 <<  4)
> >   #define GT_RENDER_CS_MASTER_ERROR_INTERRUPT	(1 <<  3)
> > diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c
> > b/drivers/gpu/drm/i915/intel_engine_cs.c
> > index 7ae753358a6d..74f563d23cc8 100644
> > --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> > +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> > @@ -1106,6 +1106,7 @@ void intel_engines_park(struct
> > drm_i915_private *i915)
> >   		/* Flush the residual irq tasklets first. */
> >   		intel_engine_disarm_breadcrumbs(engine);
> >   		tasklet_kill(&engine->execlists.tasklet);
> > +		tasklet_kill(&engine->execlists.watchdog_tasklet);
> >   
> >   		/*
> >   		 * We are committed now to parking the engines, make
> > sure there
> > diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c
> > b/drivers/gpu/drm/i915/intel_hangcheck.c
> > index 58b6ff8453dc..bc10acb24d9a 100644
> > --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> > +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> > @@ -218,7 +218,8 @@ static void hangcheck_accumulate_sample(struct
> > intel_engine_cs *engine,
> >   
> >   static void hangcheck_declare_hang(struct drm_i915_private *i915,
> >   				   unsigned int hung,
> > -				   unsigned int stuck)
> > +				   unsigned int stuck,
> > +				   unsigned int watchdog)
> >   {
> >   	struct intel_engine_cs *engine;
> >   	char msg[80];
> > @@ -231,13 +232,16 @@ static void hangcheck_declare_hang(struct
> > drm_i915_private *i915,
> >   	if (stuck != hung)
> >   		hung &= ~stuck;
> >   	len = scnprintf(msg, sizeof(msg),
> > -			"%s on ", stuck == hung ? "no progress" :
> > "hang");
> > +			"%s on ", watchdog ? "watchdog timeout" :
> > +				  stuck == hung ? "no progress" :
> > "hang");
> >   	for_each_engine_masked(engine, i915, hung, tmp)
> >   		len += scnprintf(msg + len, sizeof(msg) - len,
> >   				 "%s, ", engine->name);
> >   	msg[len-2] = '\0';
> >   
> > -	return i915_handle_error(i915, hung, I915_ERROR_CAPTURE, "%s",
> > msg);
> > +	return i915_handle_error(i915, hung,
> > +				 watchdog ? 0 : I915_ERROR_CAPTURE,
> > +				 "%s", msg);
> >   }
> >   
> >   /*
> > @@ -255,7 +259,7 @@ static void i915_hangcheck_elapsed(struct
> > work_struct *work)
> >   			     gpu_error.hangcheck_work.work);
> >   	struct intel_engine_cs *engine;
> >   	enum intel_engine_id id;
> > -	unsigned int hung = 0, stuck = 0, wedged = 0;
> > +	unsigned int hung = 0, stuck = 0, wedged = 0, watchdog = 0;
> >   
> >   	if (!i915_modparams.enable_hangcheck)
> >   		return;
> > @@ -266,6 +270,9 @@ static void i915_hangcheck_elapsed(struct
> > work_struct *work)
> >   	if (i915_terminally_wedged(&dev_priv->gpu_error))
> >   		return;
> >   
> > +	if (test_and_clear_bit(I915_RESET_WATCHDOG, &dev_priv-
> > >gpu_error.flags))
> > +		watchdog = 1;
> > +
> >   	/* As enabling the GPU requires fairly extensive mmio access,
> >   	 * periodically arm the mmio checker to see if we are
> > triggering
> >   	 * any invalid access.
> > @@ -311,7 +318,7 @@ static void i915_hangcheck_elapsed(struct
> > work_struct *work)
> >   	}
> >   
> >   	if (hung)
> > -		hangcheck_declare_hang(dev_priv, hung, stuck);
> > +		hangcheck_declare_hang(dev_priv, hung, stuck,
> > watchdog);
> >   
> >   	/* Reset timer in case GPU hangs without another request being
> > added */
> >   	i915_queue_hangcheck(dev_priv);
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c
> > b/drivers/gpu/drm/i915/intel_lrc.c
> > index 9ca7dc7a6fa5..c38b239ab39e 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -2352,6 +2352,53 @@ static int gen8_emit_flush_render(struct
> > i915_request *request,
> >   	return 0;
> >   }
> >   
> > +/* From GEN9 onwards, all engines use the same RING_CNTR format */
> > +static inline u32 get_watchdog_disable(struct intel_engine_cs
> > *engine)
> 
> I'd let the compiler decide on the inline or not.
> 
> > +{
> > +	if (engine->id == RCS || INTEL_GEN(engine->i915) >= 9)
> > +		return GEN8_WATCHDOG_DISABLE;
> > +	else
> > +		return GEN8_XCS_WATCHDOG_DISABLE;
> > +}
> > +
> > +#define GEN8_WATCHDOG_1000US(dev_priv)
> > watchdog_to_clock_counts(dev_priv, 1000)
> 
> Not sure macro is useful.
> 
> > +static void gen8_watchdog_irq_handler(unsigned long data)
> 
> gen8_watchdog_tasklet I guess.
> 
> > +{
> > +	struct intel_engine_cs *engine = (struct intel_engine_cs
> > *)data;
> > +	struct drm_i915_private *dev_priv = engine->i915;
> > +	unsigned int hung = 0;
> > +	u32 current_seqno=0;
> 
> Coding style.
> 
> > +	char msg[80];
> > +	unsigned int tmp;
> > +	int len;
> > +
> > +	/* Stop the counter to prevent further timeout interrupts */
> > +	I915_WRITE_FW(RING_CNTR(engine->mmio_base),
> > get_watchdog_disable(engine));
> 
> These registers do not need forcewake?
> 
> > +
> > +	/* Read the heartbeat seqno once again to check if we are
> > stuck? */
> > +	current_seqno = intel_engine_get_hangcheck_seqno(engine);
> > +
> > +    if (current_seqno == engine->current_seqno) {
> > +		hung |= engine->mask;
> > +
> > +		len = scnprintf(msg, sizeof(msg), "%s on ", "watchdog
> > timeout");
> > +		for_each_engine_masked(engine, dev_priv, hung, tmp)
> > +			len += scnprintf(msg + len, sizeof(msg) - len,
> > +					 "%s, ", engine->name);
> > +		msg[len-2] = '\0';
> 
> Copy/paste from intel_hangcheck.c ? Moving to common helper would be
> good.
> 
> > +
> > +		i915_handle_error(dev_priv, hung, 0, "%s", msg);
> > +
> > +		/* Reset timer in case GPU hangs without another
> > request being added */
> > +		i915_queue_hangcheck(dev_priv);
> 
> Mis-indented block.
> 
> > +    }else{
> 
> Coding style.
> 
> > +		/* Re-start the counter, if really hung, it will expire
> > again */
> > +		I915_WRITE_FW(RING_THRESH(engine->mmio_base),
> > +			      GEN8_WATCHDOG_1000US(dev_priv));
> > +		I915_WRITE_FW(RING_CNTR(engine->mmio_base),
> > GEN8_WATCHDOG_ENABLE);
> > +    }
> > +}
> > +
> >   /*
> >    * Reserve space for 2 NOOPs at the end of each request to be
> >    * used as a workaround for not being allowed to do lite
> > @@ -2539,6 +2586,21 @@ logical_ring_default_irqs(struct
> > intel_engine_cs *engine)
> >   
> >   	engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift;
> >   	engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift;
> > +
> > +	switch (engine->class) {
> > +	default:
> > +		/* BCS engine does not support hw watchdog */
> > +		break;
> > +	case RENDER_CLASS:
> > +	case VIDEO_DECODE_CLASS:
> > +		engine->irq_keep_mask |= GT_GEN8_WATCHDOG_INTERRUPT <<
> > shift;
> > +		break;
> > +	case VIDEO_ENHANCEMENT_CLASS:
> > +		if (INTEL_GEN(engine->i915) >= 9)
> > +			engine->irq_keep_mask |=
> > +				GT_GEN8_WATCHDOG_INTERRUPT << shift;
> > +		break;
> > +	}
> >   }
> >   
> >   static int
> > @@ -2556,6 +2618,9 @@ logical_ring_setup(struct intel_engine_cs
> > *engine)
> >   	tasklet_init(&engine->execlists.tasklet,
> >   		     execlists_submission_tasklet, (unsigned
> > long)engine);
> >   
> > +	tasklet_init(&engine->execlists.watchdog_tasklet,
> > +		     gen8_watchdog_irq_handler, (unsigned long)engine);
> > +
> >   	logical_ring_default_vfuncs(engine);
> >   	logical_ring_default_irqs(engine);
> >   
> > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h
> > b/drivers/gpu/drm/i915/intel_ringbuffer.h
> > index 465094e38d32..17250ba0246f 100644
> > --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> > @@ -122,6 +122,7 @@ struct intel_engine_hangcheck {
> >   	u64 acthd;
> >   	u32 last_seqno;
> >   	u32 next_seqno;
> > +	u32 watchdog;
> 
> Looks unused.
> 
> >   	unsigned long action_timestamp;
> >   	struct intel_instdone instdone;
> >   };
> > @@ -222,6 +223,11 @@ struct intel_engine_execlists {
> >   	 */
> >   	struct tasklet_struct tasklet;
> >   
> > +	/**
> > +	 * @watchdog_tasklet: stop counter and re-schedule
> > hangcheck_work asap
> > +	 */
> > +	struct tasklet_struct watchdog_tasklet;
> > +
> >   	/**
> >   	 * @default_priolist: priority list for I915_PRIORITY_NORMAL
> >   	 */
> > @@ -353,6 +359,7 @@ struct intel_engine_cs {
> >   	unsigned int hw_id;
> >   	unsigned int guc_id;
> >   	unsigned long mask;
> > +	u32 current_seqno;
> 
> I don't see where this is set in this patch?

It was declared here but assigned on patch #3 of the series, right
after the watchdog timer is started. The idea was to store the seqno we
are currently working (before we hang) and then cros check once again
right before we reset inside the irq for the watchdog. 

/* Read the heartbeat seqno once again to check if we are stuck? */
current_seqno = intel_engine_get_hangcheck_seqno(engine);

if (current_seqno == engine->current_seqno) {

> 
> And I'd recommend calling it watchdog_last_seqno or something along 
> those lines so it is obvious it is not a fundamental part of the
> engine.
> 
> >   
> >   	u8 uabi_class;
> >   
> > 
> 
> Regards,
> 
> Tvrtko