[Intel-gfx] [PATCH 4/9] drm/i915/tdr: Add support for per engine reset recovery

Mon Dec 19 05:02:33 UTC 2016

On 12/16/2016 12:45 PM, Chris Wilson wrote:
> On Fri, Dec 16, 2016 at 12:20:05PM -0800, Michel Thierry wrote:
>> From: Arun Siluvery <arun.siluvery at linux.intel.com>
>>
>> This change implements support for per-engine reset as an initial, less
>> intrusive hang recovery option to be attempted before falling back to the
>> legacy full GPU reset recovery mode if necessary. This is only supported
>> from Gen8 onwards.
>>
>> Hangchecker determines which engines are hung and invokes error handler to
>> recover from it. Error handler schedules recovery for each of those engines
>> that are hung. The recovery procedure is as follows,
>>  - identifies the request that caused the hang and it is dropped
>>  - force engine to idle: this is done by issuing a reset request
>>  - reset and re-init engine
>>  - restart submissions to the engine
>>
>> If engine reset fails then we fall back to heavy weight full gpu reset
>> which resets all engines and reinitiazes complete state of HW and SW.
>>
>> v2: Rebase.
>>
>> Cc: Chris Wilson <chris at chris-wilson.co.uk>
>> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
>> Signed-off-by: Tomas Elf <tomas.elf at intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery at linux.intel.com>
>> Signed-off-by: Michel Thierry <michel.thierry at intel.com>
>> ---
>>  drivers/gpu/drm/i915/i915_drv.c     | 56 +++++++++++++++++++++++++++++++++++--
>>  drivers/gpu/drm/i915/i915_drv.h     |  3 ++
>>  drivers/gpu/drm/i915/i915_gem.c     |  2 +-
>>  drivers/gpu/drm/i915/intel_lrc.c    | 12 ++++++++
>>  drivers/gpu/drm/i915/intel_lrc.h    |  1 +
>>  drivers/gpu/drm/i915/intel_uncore.c | 41 ++++++++++++++++++++++++---
>>  6 files changed, 108 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>> index e5688edd62cd..a034793bc246 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.c
>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>> @@ -1830,18 +1830,70 @@ void i915_reset(struct drm_i915_private *dev_priv)
>>   *
>>   * Reset a specific GPU engine. Useful if a hang is detected.
>>   * Returns zero on successful reset or otherwise an error code.
>> + *
>> + * Procedure is fairly simple:
>> + *  - identifies the request that caused the hang and it is dropped
>> + *  - force engine to idle: this is done by issuing a reset request
>> + *  - reset engine
>> + *  - restart submissions to the engine
>>   */
>>  int i915_reset_engine(struct intel_engine_cs *engine)
>
> What's the serialisation between potential callers of
> i915_reset_engine()?
>

I haven't seen simultaneous calls happening yet, would a 
reset_engine-specific mutex be enough?

>>  {
>>  	int ret;
>>  	struct drm_i915_private *dev_priv = engine->i915;
>>
>> -	/* FIXME: replace me with engine reset sequence */
>> -	ret = -ENODEV;
>> +	/*
>> +	 * We need to first idle the engine by issuing a reset request,
>> +	 * then perform soft reset and re-initialize hw state, for all of
>> +	 * this GT power need to be awake so ensure it does throughout the
>> +	 * process
>> +	 */
>> +	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
>> +
>> +	/*
>> +	 * the request that caused the hang is stuck on elsp, identify the
>> +	 * active request and drop it, adjust head to skip the offending
>> +	 * request to resume executing remaining requests in the queue.
>> +	 */
>> +	i915_gem_reset_engine(engine);
>
> Must freeze the engine and irqs first, before calling
> i915_gem_reset_engine() (i.e. something like disable_engines_irq,
> cancelling tasklet)
>
Will do.

> Eeek note that the current i915_gem_reset_engine() is lacking a
> spinlock.
>

The new mutex (for i915_reset_engine) should cover this.

>> +
>> +	ret = intel_engine_reset_begin(engine);
>> +	if (ret) {
>> +		DRM_ERROR("Failed to disable %s\n", engine->name);
>> +		goto error;
>> +	}
>> +
>> +	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
>> +	if (ret) {
>> +		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
>> +		intel_engine_reset_cancel(engine);
>> +		goto error;
>> +	}
>> +
>> +	ret = engine->init_hw(engine);
>> +	if (ret)
>> +		goto error;
>>
>> +	intel_engine_reset_cancel(engine);
>> +	intel_execlists_restart_submission(engine);
>
> engine->init_hw(engine) *is* intel_execlists_restart_submission.
>

I'll make these changes,
Thanks.

>> +
>> +	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
>> +	return 0;
>> +
>> +error:
>>  	/* use full gpu reset to recover on error */
>>  	set_bit(I915_RESET_IN_PROGRESS, &dev_priv->gpu_error.flags);
>>
>> +	/* Engine reset is performed without taking struct_mutex, since it
>> +	 * failed we now fallback to full gpu reset. Wakeup any waiters
>> +	 * which should now see the reset_in_progress and release
>> +	 * struct_mutex for us to continue recovery.
>> +	 */
>> +	rcu_read_lock();
>> +	intel_engine_wakeup(engine);
>> +	rcu_read_unlock();
>> +
>> +	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
>>  	return ret;
>>  }
>