[Intel-gfx] [PATCH v6 05/20] drm/i915/tdr: Add support for per engine reset recovery
Michel Thierry
michel.thierry at intel.com
Fri Apr 21 00:17:05 UTC 2017
On 19/04/17 03:49, Chris Wilson wrote:
> On Tue, Apr 18, 2017 at 01:23:20PM -0700, Michel Thierry wrote:
>> From: Arun Siluvery <arun.siluvery at linux.intel.com>
>>
>> This change implements support for per-engine reset as an initial, less
>> intrusive hang recovery option to be attempted before falling back to the
>> legacy full GPU reset recovery mode if necessary. This is only supported
>> from Gen8 onwards.
>>
>> Hangchecker determines which engines are hung and invokes error handler to
>> recover from it. Error handler schedules recovery for each of those engines
>> that are hung. The recovery procedure is as follows,
>> - identifies the request that caused the hang and it is dropped
>> - force engine to idle: this is done by issuing a reset request
>> - reset and re-init engine
>> - restart submissions to the engine
>>
>> If engine reset fails then we fall back to heavy weight full gpu reset
>> which resets all engines and reinitiazes complete state of HW and SW.
>>
>> v2: Rebase.
>> v3: s/*engine_reset*/*reset_engine*/; freeze engine and irqs before
>> calling i915_gem_reset_engine (Chris).
>> v4: Rebase, modify i915_gem_reset_prepare to use a ring mask and
>> reuse the function for reset_engine.
>> v5: intel_reset_engine_start/cancel instead of request/unrequest_reset.
>> v6: Clean up reset_engine function to not require mutex, i.e. no need to call
>> revoke/restore_fences and _retire_requests (Chris).
>>
>> Cc: Chris Wilson <chris at chris-wilson.co.uk>
>> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
>> Signed-off-by: Tomas Elf <tomas.elf at intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery at linux.intel.com>
>> Signed-off-by: Michel Thierry <michel.thierry at intel.com>
>> ---
>> drivers/gpu/drm/i915/i915_drv.c | 76 ++++++++++++++++++++++++--
>> drivers/gpu/drm/i915/i915_drv.h | 12 +++-
>> drivers/gpu/drm/i915/i915_gem.c | 97 +++++++++++++++++++--------------
>> drivers/gpu/drm/i915/i915_gem_request.c | 2 +-
>> drivers/gpu/drm/i915/intel_uncore.c | 20 +++++++
>> 5 files changed, 158 insertions(+), 49 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>> index e03d0643dbd6..634893cd93b3 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.c
>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>> @@ -1810,7 +1810,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
>>
>> pr_notice("drm/i915: Resetting chip after gpu hang\n");
>> disable_irq(dev_priv->drm.irq);
>> - ret = i915_gem_reset_prepare(dev_priv);
>> + ret = i915_gem_reset_prepare(dev_priv, ALL_ENGINES);
>> if (ret) {
>> DRM_ERROR("GPU recovery failed\n");
>> intel_gpu_reset(dev_priv, ALL_ENGINES);
>> @@ -1852,7 +1852,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
>> i915_queue_hangcheck(dev_priv);
>>
>> finish:
>> - i915_gem_reset_finish(dev_priv);
>> + i915_gem_reset_finish(dev_priv, ALL_ENGINES);
>> enable_irq(dev_priv->drm.irq);
>>
>> wakeup:
>> @@ -1871,11 +1871,79 @@ void i915_reset(struct drm_i915_private *dev_priv)
>> *
>> * Reset a specific GPU engine. Useful if a hang is detected.
>> * Returns zero on successful reset or otherwise an error code.
>> + *
>> + * Procedure is:
>> + * - identifies the request that caused the hang and it is dropped
>> + * - force engine to idle: this is done by issuing a reset request
>> + * - reset engine
>> + * - restart submissions to the engine
>> */
>> int i915_reset_engine(struct intel_engine_cs *engine)
>> {
>> - /* FIXME: replace me with engine reset sequence */
>> - return -ENODEV;
>> + int ret;
>> + struct drm_i915_private *dev_priv = engine->i915;
>> + struct i915_gpu_error *error = &dev_priv->gpu_error;
>> +
>> + GEM_BUG_ON(!test_bit(I915_RESET_BACKOFF, &error->flags));
>> +
>> + DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
>> +
>> + /*
>> + * We need to first idle the engine by issuing a reset request,
>> + * then perform soft reset and re-initialize hw state, for all of
>> + * this GT power need to be awake so ensure it does throughout the
>> + * process
>> + */
>> + intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
>
> Hmm, what path did we take to get here without taking rpm? I thought I
> had fixed the last offender.
>
Too many rebases... As you say, this is no longer needed after
1604a86d08053 "drm/i915: Extend rpm wakelock during i915_handle_error()"
>> + disable_irq(dev_priv->drm.irq);
>
> I am 99% certain that we don't need to disable_irq() now for per-engine
> reset... I'd keep it in the global reset as simple paranoia.
>
100% correct.
>> + ret = i915_gem_reset_prepare_engine(engine);
>> + if (ret) {
>> + DRM_ERROR("Previous reset failed - promote to full reset\n");
>> + goto error;
>> + }
>> +
>> + /*
>> + * the request that caused the hang is stuck on elsp, identify the
>> + * active request and drop it, adjust head to skip the offending
>> + * request to resume executing remaining requests in the queue.
>> + */
>
> Hmm. Interesting. This relies on i915_gem_retire_requests() (i.e.
> struct_mutex) to skip replaying innocent requests, but here we should be
> asserting that we do have the hung request.
>
> i.e.
> request = i915_gem_find_active_request(engine);
> if (!request)
> goto skip.
>
> Bonus points for tying that into i915_gem_reset_prepare_engine() so that
> we only seach for the active_request once.
>
What about something like this?
int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
...
if (engine_stalled(engine)) {
request = i915_gem_find_active_request(engine);
+ if (!request)
+ err = -ECANCELED;
+
if (request && request->fence.error == -EIO)
err = -EIO; /* Previous reset failed! */
}
and
int i915_reset_engine(struct intel_engine_cs *engine)
...
DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
ret = i915_gem_reset_prepare_engine(engine);
- if (ret) {
+ if (ret == -ECANCELED) {
+ DRM_INFO("no active request found, skip reset\n");
+ goto skip;
+ } else if (ret) {
DRM_ERROR("Previous reset failed - promote to full
reset\n");
goto out;
...
+skip:
+ i915_gem_reset_finish_engine(engine);
+ goto out;
...
I'll still have to keep the request (if found) and pass it to
i915_gem_reset_engine, to avoid the extra find_active_request.
>> + i915_gem_reset_engine(engine);
>
> Where does skip_request() get called?
>
Sorry, I didn't get this, skip_request happens under this path:
i915_gem_reset_engine
--> i915_gem_reset_request
-----> if (guilty)
>> +
>> + /* forcing engine to idle */
>> + ret = intel_reset_engine_start(engine);
>> + if (ret) {
>> + DRM_ERROR("Failed to disable %s\n", engine->name);
>> + goto error;
>> + }
>> +
>> + /* finally, reset engine */
>> + ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
>> + if (ret) {
>> + DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
>> + intel_reset_engine_cancel(engine);
>> + goto error;
>> + }
>> +
>> + /* be sure the request reset bit gets cleared */
>> + intel_reset_engine_cancel(engine);
>> +
>> + i915_gem_reset_finish_engine(engine);
>> +
>> + /* replay remaining requests in the queue */
>> + ret = engine->init_hw(engine);
>> + if (ret)
>> + goto error;
>> +
>> +wakeup:
>> + enable_irq(dev_priv->drm.irq);
>> + intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
>> + wake_up_bit(&error->flags, I915_RESET_HANDOFF);
>
> No handoff here anymore.
>
All these are leftovers from the previous version.
>> + return ret;
>> +
>> +error:
>> + /* use full gpu reset to recover on error */
>> + goto wakeup;
>> }
>
More information about the Intel-gfx
mailing list