[Intel-gfx] [PATCH 4/9] drm/i915/tdr: Add support for per engine reset recovery
Chris Wilson
chris at chris-wilson.co.uk
Fri Dec 16 20:45:21 UTC 2016
On Fri, Dec 16, 2016 at 12:20:05PM -0800, Michel Thierry wrote:
> From: Arun Siluvery <arun.siluvery at linux.intel.com>
>
> This change implements support for per-engine reset as an initial, less
> intrusive hang recovery option to be attempted before falling back to the
> legacy full GPU reset recovery mode if necessary. This is only supported
> from Gen8 onwards.
>
> Hangchecker determines which engines are hung and invokes error handler to
> recover from it. Error handler schedules recovery for each of those engines
> that are hung. The recovery procedure is as follows,
> - identifies the request that caused the hang and it is dropped
> - force engine to idle: this is done by issuing a reset request
> - reset and re-init engine
> - restart submissions to the engine
>
> If engine reset fails then we fall back to heavy weight full gpu reset
> which resets all engines and reinitiazes complete state of HW and SW.
>
> v2: Rebase.
>
> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> Signed-off-by: Tomas Elf <tomas.elf at intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery at linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry at intel.com>
> ---
> drivers/gpu/drm/i915/i915_drv.c | 56 +++++++++++++++++++++++++++++++++++--
> drivers/gpu/drm/i915/i915_drv.h | 3 ++
> drivers/gpu/drm/i915/i915_gem.c | 2 +-
> drivers/gpu/drm/i915/intel_lrc.c | 12 ++++++++
> drivers/gpu/drm/i915/intel_lrc.h | 1 +
> drivers/gpu/drm/i915/intel_uncore.c | 41 ++++++++++++++++++++++++---
> 6 files changed, 108 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index e5688edd62cd..a034793bc246 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1830,18 +1830,70 @@ void i915_reset(struct drm_i915_private *dev_priv)
> *
> * Reset a specific GPU engine. Useful if a hang is detected.
> * Returns zero on successful reset or otherwise an error code.
> + *
> + * Procedure is fairly simple:
> + * - identifies the request that caused the hang and it is dropped
> + * - force engine to idle: this is done by issuing a reset request
> + * - reset engine
> + * - restart submissions to the engine
> */
> int i915_reset_engine(struct intel_engine_cs *engine)
What's the serialisation between potential callers of
i915_reset_engine()?
> {
> int ret;
> struct drm_i915_private *dev_priv = engine->i915;
>
> - /* FIXME: replace me with engine reset sequence */
> - ret = -ENODEV;
> + /*
> + * We need to first idle the engine by issuing a reset request,
> + * then perform soft reset and re-initialize hw state, for all of
> + * this GT power need to be awake so ensure it does throughout the
> + * process
> + */
> + intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
> +
> + /*
> + * the request that caused the hang is stuck on elsp, identify the
> + * active request and drop it, adjust head to skip the offending
> + * request to resume executing remaining requests in the queue.
> + */
> + i915_gem_reset_engine(engine);
Must freeze the engine and irqs first, before calling
i915_gem_reset_engine() (i.e. something like disable_engines_irq,
cancelling tasklet)
Eeek note that the current i915_gem_reset_engine() is lacking a
spinlock.
> +
> + ret = intel_engine_reset_begin(engine);
> + if (ret) {
> + DRM_ERROR("Failed to disable %s\n", engine->name);
> + goto error;
> + }
> +
> + ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
> + if (ret) {
> + DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
> + intel_engine_reset_cancel(engine);
> + goto error;
> + }
> +
> + ret = engine->init_hw(engine);
> + if (ret)
> + goto error;
>
> + intel_engine_reset_cancel(engine);
> + intel_execlists_restart_submission(engine);
engine->init_hw(engine) *is* intel_execlists_restart_submission.
> +
> + intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
> + return 0;
> +
> +error:
> /* use full gpu reset to recover on error */
> set_bit(I915_RESET_IN_PROGRESS, &dev_priv->gpu_error.flags);
>
> + /* Engine reset is performed without taking struct_mutex, since it
> + * failed we now fallback to full gpu reset. Wakeup any waiters
> + * which should now see the reset_in_progress and release
> + * struct_mutex for us to continue recovery.
> + */
> + rcu_read_lock();
> + intel_engine_wakeup(engine);
> + rcu_read_unlock();
> +
> + intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
> return ret;
> }
--
Chris Wilson, Intel Open Source Technology Centre
More information about the Intel-gfx
mailing list