[Intel-gfx] [PATCH] drm/i915: Declare the driver wedged if hangcheck makes no progress
Chris Wilson
chris at chris-wilson.co.uk
Thu Jun 14 18:39:23 UTC 2018
Quoting Mika Kuoppala (2018-06-14 16:06:39)
> Chris Wilson <chris at chris-wilson.co.uk> writes:
>
> > Hangcheck is our back up in case the GPU or the driver gets stuck. It
> > detects when the GPU is not making any progress and issues a GPU reset.
> > However, if the driver is failing to make any progress, we can get
> > ourselves into a situation where we continually try resetting the GPU to
> > no avail. Employ a second timeout such that if we continue to see the
> > same seqno (the stalled engine has made no progress at all) over the
> > course of several hangchecks, declare the driver wedged and attempt to
> > start afresh.
> >
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala at intel.com>
> > ---
> > drivers/gpu/drm/i915/i915_debugfs.c | 5 +++--
> > drivers/gpu/drm/i915/i915_drv.h | 2 ++
> > drivers/gpu/drm/i915/intel_hangcheck.c | 17 ++++++++++++++++-
> > drivers/gpu/drm/i915/intel_ringbuffer.h | 3 ++-
> > 4 files changed, 23 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> > index 82d06a03b22f..7e68684be1e4 100644
> > --- a/drivers/gpu/drm/i915/i915_debugfs.c
> > +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> > @@ -1362,11 +1362,12 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
> > seq_printf(m, " seqno = %x [current %x, last %x]\n",
> > engine->hangcheck.seqno, seqno[id],
> > intel_engine_last_submit(engine));
> > - seq_printf(m, " waiters? %s, fake irq active? %s, stalled? %s\n",
> > + seq_printf(m, " waiters? %s, fake irq active? %s, stalled? %s, wedged? %s\n",
> > yesno(intel_engine_has_waiter(engine)),
> > yesno(test_bit(engine->id,
> > &dev_priv->gpu_error.missed_irq_rings)),
> > - yesno(engine->hangcheck.stalled));
> > + yesno(engine->hangcheck.stalled),
> > + yesno(engine->hangcheck.wedged));
> >
> > spin_lock_irq(&b->rb_lock);
> > for (rb = rb_first(&b->waiters); rb; rb = rb_next(rb)) {
> > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> > index 6649962e991a..d254a29a59a0 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.h
> > +++ b/drivers/gpu/drm/i915/i915_drv.h
> > @@ -909,6 +909,8 @@ struct i915_gem_mm {
> > #define I915_ENGINE_DEAD_TIMEOUT (4 * HZ) /* Seqno, head and subunits dead */
> > #define I915_SEQNO_DEAD_TIMEOUT (12 * HZ) /* Seqno dead with active head */
> >
> > +#define I915_ENGINE_WEDGED_TIMEOUT (60 * HZ) /* Reset but no recovery? */
> > +
> > enum modeset_restore {
> > MODESET_ON_LID_OPEN,
> > MODESET_DONE,
> > diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> > index d47e346bd49e..2fc7a0dd0df9 100644
> > --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> > +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> > @@ -294,6 +294,7 @@ static void hangcheck_store_sample(struct intel_engine_cs *engine,
> > engine->hangcheck.seqno = hc->seqno;
> > engine->hangcheck.action = hc->action;
> > engine->hangcheck.stalled = hc->stalled;
> > + engine->hangcheck.wedged = hc->wedged;
> > }
> >
> > static enum intel_engine_hangcheck_action
> > @@ -368,6 +369,9 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine,
> >
> > hc->stalled = time_after(jiffies,
> > engine->hangcheck.action_timestamp + timeout);
> > + hc->wedged = time_after(jiffies,
> > + engine->hangcheck.action_timestamp +
> > + I915_ENGINE_WEDGED_TIMEOUT);
>
> I was concerned that some callpath does end up zeroing
> hangcheck->seqno through intel_engine_init_hangcheck.
>
> It seems that seqno wrap and unparking are the
> only ones, with exception of setup paths that
> does this.
>
> But as both wrap and unpark are done with idle
> precondition, this should only at worst bring
> one extra tick to reload the seqnos.
>
> So I can't poke holes in this and it should
> prevent our driver mistakes ending up in
> endless resets loops, like the tin says.
>
> Reviewed-by: Mika Kuoppala <mika.kuoppala at linux.intel.com>
Thanks, it's just meant to be a safety net against our (driver) bugs, so
I don't mind it being coarse. Just loud and abrasive.
Pushed,
-Chris
More information about the Intel-gfx
mailing list