[Intel-gfx] [PATCH 4/9] drm/i915/tdr: Add support for per engine reset recovery

Mon Dec 19 09:51:56 UTC 2016

On Sun, Dec 18, 2016 at 09:02:33PM -0800, Michel Thierry wrote:
> 
> 
> On 12/16/2016 12:45 PM, Chris Wilson wrote:
> >On Fri, Dec 16, 2016 at 12:20:05PM -0800, Michel Thierry wrote:
> >>From: Arun Siluvery <arun.siluvery at linux.intel.com>
> >>
> >>This change implements support for per-engine reset as an initial, less
> >>intrusive hang recovery option to be attempted before falling back to the
> >>legacy full GPU reset recovery mode if necessary. This is only supported
> >>from Gen8 onwards.
> >>
> >>Hangchecker determines which engines are hung and invokes error handler to
> >>recover from it. Error handler schedules recovery for each of those engines
> >>that are hung. The recovery procedure is as follows,
> >> - identifies the request that caused the hang and it is dropped
> >> - force engine to idle: this is done by issuing a reset request
> >> - reset and re-init engine
> >> - restart submissions to the engine
> >>
> >>If engine reset fails then we fall back to heavy weight full gpu reset
> >>which resets all engines and reinitiazes complete state of HW and SW.
> >>
> >>v2: Rebase.
> >>
> >>Cc: Chris Wilson <chris at chris-wilson.co.uk>
> >>Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> >>Signed-off-by: Tomas Elf <tomas.elf at intel.com>
> >>Signed-off-by: Arun Siluvery <arun.siluvery at linux.intel.com>
> >>Signed-off-by: Michel Thierry <michel.thierry at intel.com>
> >>---
> >> drivers/gpu/drm/i915/i915_drv.c     | 56 +++++++++++++++++++++++++++++++++++--
> >> drivers/gpu/drm/i915/i915_drv.h     |  3 ++
> >> drivers/gpu/drm/i915/i915_gem.c     |  2 +-
> >> drivers/gpu/drm/i915/intel_lrc.c    | 12 ++++++++
> >> drivers/gpu/drm/i915/intel_lrc.h    |  1 +
> >> drivers/gpu/drm/i915/intel_uncore.c | 41 ++++++++++++++++++++++++---
> >> 6 files changed, 108 insertions(+), 7 deletions(-)
> >>
> >>diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> >>index e5688edd62cd..a034793bc246 100644
> >>--- a/drivers/gpu/drm/i915/i915_drv.c
> >>+++ b/drivers/gpu/drm/i915/i915_drv.c
> >>@@ -1830,18 +1830,70 @@ void i915_reset(struct drm_i915_private *dev_priv)
> >>  *
> >>  * Reset a specific GPU engine. Useful if a hang is detected.
> >>  * Returns zero on successful reset or otherwise an error code.
> >>+ *
> >>+ * Procedure is fairly simple:
> >>+ *  - identifies the request that caused the hang and it is dropped
> >>+ *  - force engine to idle: this is done by issuing a reset request
> >>+ *  - reset engine
> >>+ *  - restart submissions to the engine
> >>  */
> >> int i915_reset_engine(struct intel_engine_cs *engine)
> >
> >What's the serialisation between potential callers of
> >i915_reset_engine()?
> >
> 
> I haven't seen simultaneous calls happening yet, would a
> reset_engine-specific mutex be enough?

That is what it more or less boils down to. But first, do we ensure we
don't declare a second hang on this engine before the first reset is
complete? Then we only a barrier between a single engine reset and a
full reset.

> >> {
> >> 	int ret;
> >> 	struct drm_i915_private *dev_priv = engine->i915;
> >>
> >>-	/* FIXME: replace me with engine reset sequence */
> >>-	ret = -ENODEV;
> >>+	/*
> >>+	 * We need to first idle the engine by issuing a reset request,
> >>+	 * then perform soft reset and re-initialize hw state, for all of
> >>+	 * this GT power need to be awake so ensure it does throughout the
> >>+	 * process
> >>+	 */
> >>+	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
> >>+
> >>+	/*
> >>+	 * the request that caused the hang is stuck on elsp, identify the
> >>+	 * active request and drop it, adjust head to skip the offending
> >>+	 * request to resume executing remaining requests in the queue.
> >>+	 */
> >>+	i915_gem_reset_engine(engine);
> >
> >Must freeze the engine and irqs first, before calling
> >i915_gem_reset_engine() (i.e. something like disable_engines_irq,
> >cancelling tasklet)
> >
> Will do.
> 
> >Eeek note that the current i915_gem_reset_engine() is lacking a
> >spinlock.
> >
> 
> The new mutex (for i915_reset_engine) should cover this.

No. We need to protect these lists from concurrent manipulation in the
fences. The spinlock protection there needs to be extended here, being
covered by struct_mutex is no longer sufficient protection for the
current code.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre