[Intel-gfx] [PATCH 4/9] drm/i915/tdr: Add support for per engine reset recovery

Thu Jan 5 23:55:08 UTC 2017

On 19/12/16 01:51, Chris Wilson wrote:
> On Sun, Dec 18, 2016 at 09:02:33PM -0800, Michel Thierry wrote:
>>
>>
>> On 12/16/2016 12:45 PM, Chris Wilson wrote:
>>> On Fri, Dec 16, 2016 at 12:20:05PM -0800, Michel Thierry wrote:
>>>> From: Arun Siluvery <arun.siluvery at linux.intel.com>
>>>>
>>>> This change implements support for per-engine reset as an initial, less
>>>> intrusive hang recovery option to be attempted before falling back to the
>>>> legacy full GPU reset recovery mode if necessary. This is only supported
>>> >from Gen8 onwards.
>>>>
>>>> Hangchecker determines which engines are hung and invokes error handler to
>>>> recover from it. Error handler schedules recovery for each of those engines
>>>> that are hung. The recovery procedure is as follows,
>>>> - identifies the request that caused the hang and it is dropped
>>>> - force engine to idle: this is done by issuing a reset request
>>>> - reset and re-init engine
>>>> - restart submissions to the engine
>>>>
>>>> If engine reset fails then we fall back to heavy weight full gpu reset
>>>> which resets all engines and reinitiazes complete state of HW and SW.
>>>>
>>>> v2: Rebase.
>>>>
>>>> Cc: Chris Wilson <chris at chris-wilson.co.uk>
>>>> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
>>>> Signed-off-by: Tomas Elf <tomas.elf at intel.com>
>>>> Signed-off-by: Arun Siluvery <arun.siluvery at linux.intel.com>
>>>> Signed-off-by: Michel Thierry <michel.thierry at intel.com>
>>>> ---
>>>> drivers/gpu/drm/i915/i915_drv.c     | 56 +++++++++++++++++++++++++++++++++++--
>>>> drivers/gpu/drm/i915/i915_drv.h     |  3 ++
>>>> drivers/gpu/drm/i915/i915_gem.c     |  2 +-
>>>> drivers/gpu/drm/i915/intel_lrc.c    | 12 ++++++++
>>>> drivers/gpu/drm/i915/intel_lrc.h    |  1 +
>>>> drivers/gpu/drm/i915/intel_uncore.c | 41 ++++++++++++++++++++++++---
>>>> 6 files changed, 108 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>>>> index e5688edd62cd..a034793bc246 100644
>>>> --- a/drivers/gpu/drm/i915/i915_drv.c
>>>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>>>> @@ -1830,18 +1830,70 @@ void i915_reset(struct drm_i915_private *dev_priv)
>>>>  *
>>>>  * Reset a specific GPU engine. Useful if a hang is detected.
>>>>  * Returns zero on successful reset or otherwise an error code.
>>>> + *
>>>> + * Procedure is fairly simple:
>>>> + *  - identifies the request that caused the hang and it is dropped
>>>> + *  - force engine to idle: this is done by issuing a reset request
>>>> + *  - reset engine
>>>> + *  - restart submissions to the engine
>>>>  */
>>>> int i915_reset_engine(struct intel_engine_cs *engine)
>>>
>>> What's the serialisation between potential callers of
>>> i915_reset_engine()?
>>>
>>
>> I haven't seen simultaneous calls happening yet, would a
>> reset_engine-specific mutex be enough?
>
> That is what it more or less boils down to. But first, do we ensure we
> don't declare a second hang on this engine before the first reset is
> complete? Then we only a barrier between a single engine reset and a
> full reset.
>
>>>> {
>>>> 	int ret;
>>>> 	struct drm_i915_private *dev_priv = engine->i915;
>>>>
>>>> -	/* FIXME: replace me with engine reset sequence */
>>>> -	ret = -ENODEV;
>>>> +	/*
>>>> +	 * We need to first idle the engine by issuing a reset request,
>>>> +	 * then perform soft reset and re-initialize hw state, for all of
>>>> +	 * this GT power need to be awake so ensure it does throughout the
>>>> +	 * process
>>>> +	 */
>>>> +	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
>>>> +
>>>> +	/*
>>>> +	 * the request that caused the hang is stuck on elsp, identify the
>>>> +	 * active request and drop it, adjust head to skip the offending
>>>> +	 * request to resume executing remaining requests in the queue.
>>>> +	 */
>>>> +	i915_gem_reset_engine(engine);
>>>
>>> Must freeze the engine and irqs first, before calling
>>> i915_gem_reset_engine() (i.e. something like disable_engines_irq,
>>> cancelling tasklet)
>>>
>> Will do.
>>
>>> Eeek note that the current i915_gem_reset_engine() is lacking a
>>> spinlock.
>>>
>>
>> The new mutex (for i915_reset_engine) should cover this.
>
> No. We need to protect these lists from concurrent manipulation in the
> fences. The spinlock protection there needs to be extended here, being
> covered by struct_mutex is no longer sufficient protection for the
> current code.

I think you already addressed the last point in "drm/i915: Prevent 
timeline updates whilst performing reset" 
(00c25e3f40083a6d5f1111955baccd287ee49258), right?

But looking at these (and the interaction with possible waiters), do you 
think it is still achievable to reset a engine without grabbing the 
struct_mutex? Then both i915_reset (chip) and i915_reset_engine can use 
the "reset_in_progress".

Thanks