[Intel-gfx] [PATCH] drm/i915: Declare the driver wedged if hangcheck makes no progress

Thu Jun 14 15:06:39 UTC 2018

Chris Wilson <chris at chris-wilson.co.uk> writes:

> Hangcheck is our back up in case the GPU or the driver gets stuck. It
> detects when the GPU is not making any progress and issues a GPU reset.
> However, if the driver is failing to make any progress, we can get
> ourselves into a situation where we continually try resetting the GPU to
> no avail. Employ a second timeout such that if we continue to see the
> same seqno (the stalled engine has made no progress at all) over the
> course of several hangchecks, declare the driver wedged and attempt to
> start afresh.
>
> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala at intel.com>
> ---
>  drivers/gpu/drm/i915/i915_debugfs.c     |  5 +++--
>  drivers/gpu/drm/i915/i915_drv.h         |  2 ++
>  drivers/gpu/drm/i915/intel_hangcheck.c  | 17 ++++++++++++++++-
>  drivers/gpu/drm/i915/intel_ringbuffer.h |  3 ++-
>  4 files changed, 23 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> index 82d06a03b22f..7e68684be1e4 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -1362,11 +1362,12 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
>  		seq_printf(m, "\tseqno = %x [current %x, last %x]\n",
>  			   engine->hangcheck.seqno, seqno[id],
>  			   intel_engine_last_submit(engine));
> -		seq_printf(m, "\twaiters? %s, fake irq active? %s, stalled? %s\n",
> +		seq_printf(m, "\twaiters? %s, fake irq active? %s, stalled? %s, wedged? %s\n",
>  			   yesno(intel_engine_has_waiter(engine)),
>  			   yesno(test_bit(engine->id,
>  					  &dev_priv->gpu_error.missed_irq_rings)),
> -			   yesno(engine->hangcheck.stalled));
> +			   yesno(engine->hangcheck.stalled),
> +			   yesno(engine->hangcheck.wedged));
>  
>  		spin_lock_irq(&b->rb_lock);
>  		for (rb = rb_first(&b->waiters); rb; rb = rb_next(rb)) {
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 6649962e991a..d254a29a59a0 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -909,6 +909,8 @@ struct i915_gem_mm {
>  #define I915_ENGINE_DEAD_TIMEOUT  (4 * HZ)  /* Seqno, head and subunits dead */
>  #define I915_SEQNO_DEAD_TIMEOUT   (12 * HZ) /* Seqno dead with active head */
>  
> +#define I915_ENGINE_WEDGED_TIMEOUT  (60 * HZ)  /* Reset but no recovery? */
> +
>  enum modeset_restore {
>  	MODESET_ON_LID_OPEN,
>  	MODESET_DONE,
> diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> index d47e346bd49e..2fc7a0dd0df9 100644
> --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> @@ -294,6 +294,7 @@ static void hangcheck_store_sample(struct intel_engine_cs *engine,
>  	engine->hangcheck.seqno = hc->seqno;
>  	engine->hangcheck.action = hc->action;
>  	engine->hangcheck.stalled = hc->stalled;
> +	engine->hangcheck.wedged = hc->wedged;
>  }
>  
>  static enum intel_engine_hangcheck_action
> @@ -368,6 +369,9 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine,
>  
>  	hc->stalled = time_after(jiffies,
>  				 engine->hangcheck.action_timestamp + timeout);
> +	hc->wedged = time_after(jiffies,
> +				 engine->hangcheck.action_timestamp +
> +				 I915_ENGINE_WEDGED_TIMEOUT);

I was concerned that some callpath does end up zeroing
hangcheck->seqno through intel_engine_init_hangcheck.

It seems that seqno wrap and unparking are the
only ones, with exception of setup paths that
does this.

But as both wrap and unpark are done with idle
precondition, this should only at worst bring
one extra tick to reload the seqnos.

So I can't poke holes in this and it should
prevent our driver mistakes ending up in
endless resets loops, like the tin says.

Reviewed-by: Mika Kuoppala <mika.kuoppala at linux.intel.com>