[Intel-gfx] [PATCH 4/9] drm/i915/tdr: Add support for per engine reset recovery

Mon Dec 19 14:42:15 UTC 2016

On Mon, Dec 19, 2016 at 04:24:18PM +0200, Mika Kuoppala wrote:
> Chris Wilson <chris at chris-wilson.co.uk> writes:
> 
> > On Fri, Dec 16, 2016 at 12:20:05PM -0800, Michel Thierry wrote:
> >> From: Arun Siluvery <arun.siluvery at linux.intel.com>
> >> 
> >> This change implements support for per-engine reset as an initial, less
> >> intrusive hang recovery option to be attempted before falling back to the
> >> legacy full GPU reset recovery mode if necessary. This is only supported
> >> from Gen8 onwards.
> >> 
> >> Hangchecker determines which engines are hung and invokes error handler to
> >> recover from it. Error handler schedules recovery for each of those engines
> >> that are hung. The recovery procedure is as follows,
> >>  - identifies the request that caused the hang and it is dropped
> >>  - force engine to idle: this is done by issuing a reset request
> >>  - reset and re-init engine
> >>  - restart submissions to the engine
> >> 
> >> If engine reset fails then we fall back to heavy weight full gpu reset
> >> which resets all engines and reinitiazes complete state of HW and SW.
> >> 
> >> v2: Rebase.
> >> 
> >> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> >> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> >> Signed-off-by: Tomas Elf <tomas.elf at intel.com>
> >> Signed-off-by: Arun Siluvery <arun.siluvery at linux.intel.com>
> >> Signed-off-by: Michel Thierry <michel.thierry at intel.com>
> >> ---
> >>  drivers/gpu/drm/i915/i915_drv.c     | 56 +++++++++++++++++++++++++++++++++++--
> >>  drivers/gpu/drm/i915/i915_drv.h     |  3 ++
> >>  drivers/gpu/drm/i915/i915_gem.c     |  2 +-
> >>  drivers/gpu/drm/i915/intel_lrc.c    | 12 ++++++++
> >>  drivers/gpu/drm/i915/intel_lrc.h    |  1 +
> >>  drivers/gpu/drm/i915/intel_uncore.c | 41 ++++++++++++++++++++++++---
> >>  6 files changed, 108 insertions(+), 7 deletions(-)
> >> 
> >> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> >> index e5688edd62cd..a034793bc246 100644
> >> --- a/drivers/gpu/drm/i915/i915_drv.c
> >> +++ b/drivers/gpu/drm/i915/i915_drv.c
> >> @@ -1830,18 +1830,70 @@ void i915_reset(struct drm_i915_private *dev_priv)
> >>   *
> >>   * Reset a specific GPU engine. Useful if a hang is detected.
> >>   * Returns zero on successful reset or otherwise an error code.
> >> + *
> >> + * Procedure is fairly simple:
> >> + *  - identifies the request that caused the hang and it is dropped
> >> + *  - force engine to idle: this is done by issuing a reset request
> >> + *  - reset engine
> >> + *  - restart submissions to the engine
> >>   */
> >>  int i915_reset_engine(struct intel_engine_cs *engine)
> >
> > What's the serialisation between potential callers of
> > i915_reset_engine()?
> >
> 
> I feel that making 'reset_in_progress' per engine feature
> would clarify this and would be more fitting as now it is
> the one engine that can be in reset at particular point in time,
> decoupled with others.

That is not what "reset_in_progress" means. It principally means
MUTEX_BACKOFF.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre