[Intel-gfx] [PATCH v6 05/20] drm/i915/tdr: Add support for per engine reset recovery
Chris Wilson
chris at chris-wilson.co.uk
Wed Apr 19 10:49:26 UTC 2017
On Tue, Apr 18, 2017 at 01:23:20PM -0700, Michel Thierry wrote:
> From: Arun Siluvery <arun.siluvery at linux.intel.com>
>
> This change implements support for per-engine reset as an initial, less
> intrusive hang recovery option to be attempted before falling back to the
> legacy full GPU reset recovery mode if necessary. This is only supported
> from Gen8 onwards.
>
> Hangchecker determines which engines are hung and invokes error handler to
> recover from it. Error handler schedules recovery for each of those engines
> that are hung. The recovery procedure is as follows,
> - identifies the request that caused the hang and it is dropped
> - force engine to idle: this is done by issuing a reset request
> - reset and re-init engine
> - restart submissions to the engine
>
> If engine reset fails then we fall back to heavy weight full gpu reset
> which resets all engines and reinitiazes complete state of HW and SW.
>
> v2: Rebase.
> v3: s/*engine_reset*/*reset_engine*/; freeze engine and irqs before
> calling i915_gem_reset_engine (Chris).
> v4: Rebase, modify i915_gem_reset_prepare to use a ring mask and
> reuse the function for reset_engine.
> v5: intel_reset_engine_start/cancel instead of request/unrequest_reset.
> v6: Clean up reset_engine function to not require mutex, i.e. no need to call
> revoke/restore_fences and _retire_requests (Chris).
>
> Cc: Chris Wilson <chris at chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
> Signed-off-by: Tomas Elf <tomas.elf at intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery at linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry at intel.com>
> ---
> drivers/gpu/drm/i915/i915_drv.c | 76 ++++++++++++++++++++++++--
> drivers/gpu/drm/i915/i915_drv.h | 12 +++-
> drivers/gpu/drm/i915/i915_gem.c | 97 +++++++++++++++++++--------------
> drivers/gpu/drm/i915/i915_gem_request.c | 2 +-
> drivers/gpu/drm/i915/intel_uncore.c | 20 +++++++
> 5 files changed, 158 insertions(+), 49 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index e03d0643dbd6..634893cd93b3 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1810,7 +1810,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
>
> pr_notice("drm/i915: Resetting chip after gpu hang\n");
> disable_irq(dev_priv->drm.irq);
> - ret = i915_gem_reset_prepare(dev_priv);
> + ret = i915_gem_reset_prepare(dev_priv, ALL_ENGINES);
> if (ret) {
> DRM_ERROR("GPU recovery failed\n");
> intel_gpu_reset(dev_priv, ALL_ENGINES);
> @@ -1852,7 +1852,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
> i915_queue_hangcheck(dev_priv);
>
> finish:
> - i915_gem_reset_finish(dev_priv);
> + i915_gem_reset_finish(dev_priv, ALL_ENGINES);
> enable_irq(dev_priv->drm.irq);
>
> wakeup:
> @@ -1871,11 +1871,79 @@ void i915_reset(struct drm_i915_private *dev_priv)
> *
> * Reset a specific GPU engine. Useful if a hang is detected.
> * Returns zero on successful reset or otherwise an error code.
> + *
> + * Procedure is:
> + * - identifies the request that caused the hang and it is dropped
> + * - force engine to idle: this is done by issuing a reset request
> + * - reset engine
> + * - restart submissions to the engine
> */
> int i915_reset_engine(struct intel_engine_cs *engine)
> {
> - /* FIXME: replace me with engine reset sequence */
> - return -ENODEV;
> + int ret;
> + struct drm_i915_private *dev_priv = engine->i915;
> + struct i915_gpu_error *error = &dev_priv->gpu_error;
> +
> + GEM_BUG_ON(!test_bit(I915_RESET_BACKOFF, &error->flags));
> +
> + DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
> +
> + /*
> + * We need to first idle the engine by issuing a reset request,
> + * then perform soft reset and re-initialize hw state, for all of
> + * this GT power need to be awake so ensure it does throughout the
> + * process
> + */
> + intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
Hmm, what path did we take to get here without taking rpm? I thought I
had fixed the last offender.
> + disable_irq(dev_priv->drm.irq);
I am 99% certain that we don't need to disable_irq() now for per-engine
reset... I'd keep it in the global reset as simple paranoia.
> + ret = i915_gem_reset_prepare_engine(engine);
> + if (ret) {
> + DRM_ERROR("Previous reset failed - promote to full reset\n");
> + goto error;
> + }
> +
> + /*
> + * the request that caused the hang is stuck on elsp, identify the
> + * active request and drop it, adjust head to skip the offending
> + * request to resume executing remaining requests in the queue.
> + */
Hmm. Interesting. This relies on i915_gem_retire_requests() (i.e.
struct_mutex) to skip replaying innocent requests, but here we should be
asserting that we do have the hung request.
i.e.
request = i915_gem_find_active_request(engine);
if (!request)
goto skip.
Bonus points for tying that into i915_gem_reset_prepare_engine() so that
we only seach for the active_request once.
> + i915_gem_reset_engine(engine);
Where does skip_request() get called?
> +
> + /* forcing engine to idle */
> + ret = intel_reset_engine_start(engine);
> + if (ret) {
> + DRM_ERROR("Failed to disable %s\n", engine->name);
> + goto error;
> + }
> +
> + /* finally, reset engine */
> + ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
> + if (ret) {
> + DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
> + intel_reset_engine_cancel(engine);
> + goto error;
> + }
> +
> + /* be sure the request reset bit gets cleared */
> + intel_reset_engine_cancel(engine);
> +
> + i915_gem_reset_finish_engine(engine);
> +
> + /* replay remaining requests in the queue */
> + ret = engine->init_hw(engine);
> + if (ret)
> + goto error;
> +
> +wakeup:
> + enable_irq(dev_priv->drm.irq);
> + intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
> + wake_up_bit(&error->flags, I915_RESET_HANDOFF);
No handoff here anymore.
> + return ret;
> +
> +error:
> + /* use full gpu reset to recover on error */
> + goto wakeup;
> }
--
Chris Wilson, Intel Open Source Technology Centre
More information about the Intel-gfx
mailing list