[Intel-gfx] [PATCH v6 04/20] drm/i915/tdr: Modify error handler for per engine hang recovery

Tue Apr 18 22:01:02 UTC 2017

On 18/04/17 14:40, Chris Wilson wrote:
> On Tue, Apr 18, 2017 at 01:23:19PM -0700, Michel Thierry wrote:
>> From: Arun Siluvery <arun.siluvery at linux.intel.com>
>>
>> This is a preparatory patch which modifies error handler to do per engine
>> hang recovery. The actual patch which implements this sequence follows
>> later in the series. The aim is to prepare existing recovery function to
>> adapt to this new function where applicable (which fails at this point
>> because core implementation is lacking) and continue recovery using legacy
>> full gpu reset.
>>
>> A helper function is also added to query the availability of engine
>> reset.
>>
>> The error events behaviour that are used to notify user of reset are
>> adapted to engine reset such that it doesn't break users listening to these
>> events. In legacy we report an error event, a reset event before resetting
>> the gpu and a reset done event marking the completion of reset. The same
>> behaviour is adapted but reset event is only dispatched once even when
>> multiple engines are hung. Finally once reset is complete we send reset
>> done event as usual.
>>
>> Note that this implementation of engine reset is for i915 directly
>> submitting to the ELSP, where the driver manages the hang detection,
>> recovery and resubmission. With GuC submission these tasks are shared
>> between driver and firmware; i915 will still responsible for detecting a
>> hang, and when it does it will have to request GuC to reset that Engine and
>> remind the firmware about the outstanding submissions. This will be
>> added in different patch.
>>
>> v2: rebase, advertise engine reset availability in platform definition,
>> add note about GuC submission.
>> v3: s/*engine_reset*/*reset_engine*/. (Chris)
>> Handle reset as 2 level resets, by first going to engine only and fall
>> backing to full/chip reset as needed, i.e. reset_engine will need the
>> struct_mutex.
>> v4: Pass the engine mask to i915_reset. (Chris)
>> v5: Rebase, update selftests.
>> v6: Rebase, prepare for mutex-less reset engine.
>
> I'm not sure if there is any trace of the original patch left. Certainly
> this is the first that has come close to making me happy and looks like
> it might actually work. :)
>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
>> index e06af46f5a57..7bc5f552add7 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -814,6 +814,7 @@ struct intel_csr {
>>  	func(has_ddi); \
>>  	func(has_decoupled_mmio); \
>>  	func(has_dp_mst); \
>> +	func(has_reset_engine); \
>>  	func(has_fbc); \
>>  	func(has_fpga_dbg); \
>>  	func(has_full_ppgtt); \
>> @@ -1616,6 +1617,9 @@ struct i915_gpu_error {
>>  #define I915_RESET_HANDOFF	1
>>  #define I915_WEDGED		(BITS_PER_LONG - 1)
>>
>> +	/* if available, engine-specific reset is tried before full gpu reset */
>> +	u32 reset_engine_mask;
>
> I want to convince myself that storing this here is sensible. My
> expectation was that this would be passed along as a function parameter,
> so I'll need to go back and see why that doesn't work.
>

This is a left-over from the previous version, the engine mask can be 
passed as a parameter to i915_reset_and_wakeup.

>>  	/**
>>  	 * Waitqueue to signal when a hang is detected. Used to for waiters
>>  	 * to release the struct_mutex for the reset to procede.
>> @@ -3019,6 +3023,8 @@ extern void i915_driver_unload(struct drm_device *dev);
>>  extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
>>  extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
>>  extern void i915_reset(struct drm_i915_private *dev_priv);
>> +extern int i915_reset_engine(struct intel_engine_cs *engine);
>> +extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
>>  extern int intel_guc_reset(struct drm_i915_private *dev_priv);
>>  extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
>>  extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
>> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
>> index fd97fe00cd0d..ab1d77ab0977 100644
>> --- a/drivers/gpu/drm/i915/i915_irq.c
>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>> @@ -2645,12 +2645,33 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
>>  	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
>>  	char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
>>  	char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
>> +	u32 engine_mask = dev_priv->gpu_error.reset_engine_mask;
>>
>>  	kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
>>
>> -	DRM_DEBUG_DRIVER("resetting chip\n");
>> +	/*
>> +	 * This event needs to be sent before performing gpu reset. When
>> +	 * engine resets are supported we iterate through all engines and
>> +	 * reset hung engines individually. To keep the event dispatch
>> +	 * mechanism consistent with full gpu reset, this is only sent once
>> +	 * even when multiple engines are hung. It is also safe to move this
>> +	 * here because when we are in this function, we will definitely
>> +	 * perform gpu reset.
>> +	 */
>>  	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
>>
>> +	/* try engine reset first, and continue if fails; look mom, no mutex! */
>> +	if (intel_has_reset_engine(dev_priv) && !(engine_mask & (engine_mask - 1))) {
>
> if (has_reset_engine() && is_power_of_2(engine_mask)) {
> 	engine = dev_priv->engine[ilog2(engine_mask)];
>
> Not worth iterating over the engine_mask?
>
> for_each_bit() {
> 	if (i915_reset_engine(engine) == 0)
> 		dev_prv->gpu_error.reset_engine_mask &= ~intel_engine_flag()
> if (reset_engine_mask)
> 	DRM_WARN("per-engine reset failed, promoting to full gpu_reset\n");
>

Time-wise, I'm not so sure about the benefits of 2 reset_engines vs full 
gpu reset. But it can be done... and thanks for the ilog2 tip.

>> +		struct intel_engine_cs *engine =
>> +			dev_priv->engine[intel_engineid_from_flag(engine_mask)];
>> +
>> +		if (i915_reset_engine(engine) == 0)
>> +			goto finish;
>> +
>> +		DRM_ERROR("reset %s failed, promoting to full gpu reset\n",
>> +			  engine->name);
>
> Justify an *ERROR*. Is it a catastrophe? Do we have a recovery
> mechanism? A DRM_WARN is appropriate as something went wrong at the
> hw/driver level, but it should not be the end of the world - unlike the
> global reset itself failing.
>

I'll change it to DRM_WARN. I haven't been able to reproduce it (only 
when I change the code on purpose, or with the live test - 
igt_render_engine_reset_fallback)

>> +	}
>> +
>>  	intel_prepare_reset(dev_priv);
>
> I am much happier now that we circumvent the big hammer global/display
> reset (and so this series now can remain blissfully ignorant of the
> display reset regression ;)
> -Chris
>