[Intel-gfx] [PATCH v7 02/20] drm/i915: Modify error handler for per engine hang recovery

Mon May 8 18:31:08 UTC 2017

On 4/29/2017 7:19 AM, Chris Wilson wrote:
> On Thu, Apr 27, 2017 at 04:12:42PM -0700, Michel Thierry wrote:
>> From: Arun Siluvery <arun.siluvery at linux.intel.com>
>>
>> This is a preparatory patch which modifies error handler to do per engine
>> hang recovery. The actual patch which implements this sequence follows
>> later in the series. The aim is to prepare existing recovery function to
>> adapt to this new function where applicable (which fails at this point
>> because core implementation is lacking) and continue recovery using legacy
>> full gpu reset.
>>
>> A helper function is also added to query the availability of engine
>> reset.
>>
>> The error events behaviour that are used to notify user of reset are
>> adapted to engine reset such that it doesn't break users listening to these
>> events. In legacy we report an error event, a reset event before resetting
>> the gpu and a reset done event marking the completion of reset. The same
>> behaviour is adapted but reset event is only dispatched once even when
>> multiple engines are hung. Finally once reset is complete we send reset
>> done event as usual.
>>
>> Note that this implementation of engine reset is for i915 directly
>> submitting to the ELSP, where the driver manages the hang detection,
>> recovery and resubmission. With GuC submission these tasks are shared
>> between driver and firmware; i915 will still responsible for detecting a
>> hang, and when it does it will have to request GuC to reset that Engine and
>> remind the firmware about the outstanding submissions. This will be
>> added in different patch.
>>
>> v2: rebase, advertise engine reset availability in platform definition,
>> add note about GuC submission.
>> v3: s/*engine_reset*/*reset_engine*/. (Chris)
>> Handle reset as 2 level resets, by first going to engine only and fall
>> backing to full/chip reset as needed, i.e. reset_engine will need the
>> struct_mutex.
>> v4: Pass the engine mask to i915_reset. (Chris)
>> v5: Rebase, update selftests.
>> v6: Rebase, prepare for mutex-less reset engine.
>> v7: Pass reset_engine mask as a function parameter, and iterate over the
>> engine mask for reset_engine. (Chris)
>>
>> Cc: Chris Wilson <chris at chris-wilson.co.uk>
>> Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
>> Signed-off-by: Ian Lister <ian.lister at intel.com>
>> Signed-off-by: Tomas Elf <tomas.elf at intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery at linux.intel.com>
>> Signed-off-by: Michel Thierry <michel.thierry at intel.com>
>> ---
>>  drivers/gpu/drm/i915/i915_drv.c     | 15 +++++++++++++++
>>  drivers/gpu/drm/i915/i915_drv.h     |  3 +++
>>  drivers/gpu/drm/i915/i915_irq.c     | 33 ++++++++++++++++++++++++++++++---
>>  drivers/gpu/drm/i915/i915_pci.c     |  5 ++++-
>>  drivers/gpu/drm/i915/intel_uncore.c | 11 +++++++++++
>>  5 files changed, 63 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>> index c7d68e789642..48c8b69d9bde 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.c
>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>> @@ -1800,6 +1800,8 @@ void i915_reset(struct drm_i915_private *dev_priv)
>>  	if (!test_bit(I915_RESET_HANDOFF, &error->flags))
>>  		return;
>>
>> +	DRM_DEBUG_DRIVER("resetting chip\n");
>
> This is redundant since we have a "Resetting chip" already here. Just
> kill it.
>
>> +
>>  	/* Clear any previous failed attempts at recovery. Time to try again. */
>>  	if (!i915_gem_unset_wedged(dev_priv))
>>  		goto wakeup;
>> @@ -1863,6 +1865,19 @@ void i915_reset(struct drm_i915_private *dev_priv)
>>  	goto finish;
>>  }
>>
>> +/**
>> + * i915_reset_engine - reset GPU engine to recover from a hang
>> + * @engine: engine to reset
>> + *
>> + * Reset a specific GPU engine. Useful if a hang is detected.
>> + * Returns zero on successful reset or otherwise an error code.
>> + */
>> +int i915_reset_engine(struct intel_engine_cs *engine)
>> +{
>> +	/* FIXME: replace me with engine reset sequence */
>> +	return -ENODEV;
>> +}
>> +
>>  static int i915_pm_suspend(struct device *kdev)
>>  {
>>  	struct pci_dev *pdev = to_pci_dev(kdev);
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
>> index e06af46f5a57..ab7e68626c49 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -814,6 +814,7 @@ struct intel_csr {
>>  	func(has_ddi); \
>>  	func(has_decoupled_mmio); \
>>  	func(has_dp_mst); \
>> +	func(has_reset_engine); \
>>  	func(has_fbc); \
>>  	func(has_fpga_dbg); \
>>  	func(has_full_ppgtt); \
>> @@ -3019,6 +3020,8 @@ extern void i915_driver_unload(struct drm_device *dev);
>>  extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
>>  extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
>>  extern void i915_reset(struct drm_i915_private *dev_priv);
>> +extern int i915_reset_engine(struct intel_engine_cs *engine);
>> +extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
>>  extern int intel_guc_reset(struct drm_i915_private *dev_priv);
>>  extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
>>  extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
>> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
>> index fd97fe00cd0d..3a59ef1367ec 100644
>> --- a/drivers/gpu/drm/i915/i915_irq.c
>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>> @@ -2635,11 +2635,13 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg)
>>  /**
>>   * i915_reset_and_wakeup - do process context error handling work
>>   * @dev_priv: i915 device private
>> + * @engine_mask: engine(s) hung - for reset-engine only.
>>   *
>>   * Fire an error uevent so userspace can see that a hang or error
>>   * was detected.
>>   */
>> -static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
>> +static void
>> +i915_reset_and_wakeup(struct drm_i915_private *dev_priv, u32 engine_mask)
>>  {
>>  	struct kobject *kobj = &dev_priv->drm.primary->kdev->kobj;
>>  	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
>> @@ -2648,9 +2650,33 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
>>
>>  	kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
>>
>> -	DRM_DEBUG_DRIVER("resetting chip\n");
>> +	/*
>> +	 * This event needs to be sent before performing gpu reset. When
>> +	 * engine resets are supported we iterate through all engines and
>> +	 * reset hung engines individually. To keep the event dispatch
>> +	 * mechanism consistent with full gpu reset, this is only sent once
>> +	 * even when multiple engines are hung. It is also safe to move this
>> +	 * here because when we are in this function, we will definitely
>> +	 * perform gpu reset.
>> +	 */
>>  	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
>>
>> +	/* try engine reset first, and continue if fails; look mom, no mutex! */
>> +	if (intel_has_reset_engine(dev_priv)) {
>> +		struct intel_engine_cs *engine;
>> +		unsigned int tmp;
>> +
>> +		for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
>> +			if (i915_reset_engine(engine) == 0)
>> +				engine_mask &= ~intel_engine_flag(engine);
>> +		}
>> +
>> +		if (engine_mask)
>> +			DRM_WARN("per-engine reset failed, promoting to full gpu reset\n");
>> +		else
>> +			goto finish;
>
> This will look nicer if we did just try per-engine reset and then
> quitely promote (it's not that quiet as we do get logging) to global.
>
> for_each_engine_masked() {}
> if (!engine_mask)
>
>
>> +	}
>> +
>>  	intel_prepare_reset(dev_priv);
>>
>>  	set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
>> @@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
>>  		kobject_uevent_env(kobj,
>>  				   KOBJ_CHANGE, reset_done_event);
>>
>> +finish:
>>  	/*
>>  	 * Note: The wake_up also serves as a memory barrier so that
>>  	 * waiters see the updated value of the dev_priv->gpu_error.
>> @@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
>>  			     &dev_priv->gpu_error.flags))
>>  		goto out;
>>
>> -	i915_reset_and_wakeup(dev_priv);
>> +	i915_reset_and_wakeup(dev_priv, engine_mask);
>
> ? You don't need to wakeup the struct_mutex so we don't need this after
> per-engine resets. Time to split up i915_reset_and_wakeup(), because we
> certainly shouldn't be calling intel_finish_reset() without first calling
> intel_prepare_reset(). Which is right here in my tree...
>

Looking at your tree, it wouldn't call finish_reset there either, only 
these two are called after a successful reset:

finish:
	clear_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags);
	wake_up_all(&dev_priv->gpu_error.reset_queue);

But you're right, we only need to clear the error flag, no need to call 
wake_up_all.

Should I move the per-engine reset to i915_handle_error, and then leave 
i915_reset_and_wakeup just for full resets?
That would also make the promotion from per-engine to global look a bit 
'clearer'.

Thanks,

>> +/*
>> + * When GuC submission is enabled, GuC manages ELSP and can initiate the
>> + * engine reset too. For now, fall back to full GPU reset if it is enabled.
>> + */
>> +bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
>> +{
>> +	return (dev_priv->info.has_reset_engine &&
>> +		!dev_priv->guc.execbuf_client &&
>> +		i915.reset == 2);
>
> i915.reset >= 2
> -Chris
>