[Intel-gfx] [PATCH 3/3] drm/i915: Declare the driver wedged if hangcheck makes no progress

Tue Jun 5 09:33:45 UTC 2018

Quoting Mika Kuoppala (2018-06-05 10:26:37)
> Chris Wilson <chris at chris-wilson.co.uk> writes:
> 
> > Hangcheck is our back up in case the GPU or the driver gets stuck. It
> > detects when the GPU is not making any progress and issues a GPU reset.
> > However, if the driver is failing to make any progress, we can get
> > ourselves into a situation where we continually try resetting the GPU to
> > no avail. Employ a second timeout such that if we continue to see the
> > same seqno (the stalled engine has made no progress at all) over the
> > course of several hangchecks, declare the driver wedged and attempt to
> > start afresh.
> >
> > Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala at intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_debugfs.c     |  5 +++--
> >  drivers/gpu/drm/i915/i915_drv.h         |  2 ++
> >  drivers/gpu/drm/i915/intel_hangcheck.c  | 17 ++++++++++++++++-
> >  drivers/gpu/drm/i915/intel_ringbuffer.h |  3 ++-
> >  4 files changed, 23 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> > index 15e86d34a81c..2eae3abb0ec1 100644
> > --- a/drivers/gpu/drm/i915/i915_debugfs.c
> > +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> > @@ -1359,11 +1359,12 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
> >               seq_printf(m, "    seqno = %x [current %x, last %x]\n",
> >                          engine->hangcheck.seqno, seqno[id],
> >                          intel_engine_last_submit(engine));
> > -             seq_printf(m, "    waiters? %s, fake irq active? %s, stalled? %s\n",
> > +             seq_printf(m, "    waiters? %s, fake irq active? %s, stalled? %s, wedged? %s\n",
> >                          yesno(intel_engine_has_waiter(engine)),
> >                          yesno(test_bit(engine->id,
> >                                         &dev_priv->gpu_error.missed_irq_rings)),
> > -                        yesno(engine->hangcheck.stalled));
> > +                        yesno(engine->hangcheck.stalled),
> > +                        yesno(engine->hangcheck.wedged));
> >  
> >               spin_lock_irq(&b->rb_lock);
> >               for (rb = rb_first(&b->waiters); rb; rb = rb_next(rb)) {
> > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> > index 38157df6ff5c..a4ed0baeb0ed 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.h
> > +++ b/drivers/gpu/drm/i915/i915_drv.h
> > @@ -994,6 +994,8 @@ struct i915_gem_mm {
> >  #define I915_ENGINE_DEAD_TIMEOUT  (4 * HZ)  /* Seqno, head and subunits dead */
> >  #define I915_SEQNO_DEAD_TIMEOUT   (12 * HZ) /* Seqno dead with active head */
> >  
> > +#define I915_ENGINE_WEDGED_TIMEOUT  (60 * HZ)  /* Reset but no recovery? */
> > +
> >  enum modeset_restore {
> >       MODESET_ON_LID_OPEN,
> >       MODESET_DONE,
> > diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> > index d47e346bd49e..2fc7a0dd0df9 100644
> > --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> > +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> > @@ -294,6 +294,7 @@ static void hangcheck_store_sample(struct intel_engine_cs *engine,
> >       engine->hangcheck.seqno = hc->seqno;
> >       engine->hangcheck.action = hc->action;
> >       engine->hangcheck.stalled = hc->stalled;
> > +     engine->hangcheck.wedged = hc->wedged;
> >  }
> >  
> >  static enum intel_engine_hangcheck_action
> > @@ -368,6 +369,9 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine,
> >  
> >       hc->stalled = time_after(jiffies,
> >                                engine->hangcheck.action_timestamp + timeout);
> > +     hc->wedged = time_after(jiffies,
> > +                              engine->hangcheck.action_timestamp +
> > +                              I915_ENGINE_WEDGED_TIMEOUT);
> 
> As the init_hangcheck() will clear the wedged state,
> our reset failure paths needs to steer clear of this.

They do by definition (since the reset failed :)
 
> Would it be better if we warn on the wedged being set
> in init paths and explicitly clear it on unset_wedged?

I don't think so, if we reset, we may as well regard the reset as
completed. We detect repeated attempts to do the same reset; it's just
we need a back in case we don't do any reset.
-Chris