[Intel-gfx] [PATCH v8 03/20] drm/i915: Modify error handler for per engine hang recovery

Fri Jun 2 20:38:56 UTC 2017

On 6/2/2017 1:16 PM, Chris Wilson wrote:
> Quoting Michel Thierry (2017-05-22 18:46:24)
>> +       /* try engine reset first, and continue if fails */
> 
> /* Please use sentences when convenient. It looks much neater that way. */
> 

_less_ broken English:

/*
  * Try engine reset when available. We fall back to full reset if
  * single reset fails.
  */

>> +       if (intel_has_reset_engine(dev_priv)) {
>> +               struct intel_engine_cs *engine;
>> +               unsigned int tmp;
>> +
>> +               /* protect against concurrent reset attempts */
>> +               if (test_and_set_bit(I915_RESET_ENGINE_IN_PROGRESS,
>> +                                    &dev_priv->gpu_error.flags))
>> +                       goto out;
>> +
>> +               for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
>> +                       if (i915_reset_engine(engine) == 0)
>> +                               engine_mask &= ~intel_engine_flag(engine);
>> +               }
>> +
>> +               /* clear unconditionally, full reset won't care about it */
>> +               clear_bit(I915_RESET_ENGINE_IN_PROGRESS,
>> +                         &dev_priv->gpu_error.flags);
>> +
>> +               if (!engine_mask)
>> +                       goto out;
>> +       }
>> +
>> +       /* full reset needs the mutex, stop any other user trying to do so */
>>          if (test_and_set_bit(I915_RESET_BACKOFF,
>>                               &dev_priv->gpu_error.flags))
> 
> We have a big gaping hole here in coordination between a global reset
> and per-engine resets.
> 
> I think you want to do something like:
> 
> if (intel_has_engine_reset()) {
> 	for_each_engine_mask() {
> 		if (test_and_set_bit(I915_RESET_ENGINE + engine->id))
> 			continue;
> 
> 		if (i915_reset_engine() == 0)
> 			engine_mask &= ~engine->mask;
> 
> 		clear_bit(I915_RESET_ENGINE + engine->id);
> 		wake_up_bit(I915_RESET_ENGINE + engine->id);
> 	}
> }
> 
> if (!engine_mask)
> 	goto out;
> 
> if (test_and_set_bit(I915_RESET_BACKOFF))
> 	goto out;
> 
> for_each_engine() {
> 	wait_on_bit(I915_RESET_ENGINE + engine->id);
> 	set_bit(I915_RESET_ENGINE);
> }
> 
> ...do global reset...
> 
> for_each_engine() {
> 	clear_bit(I915_RESET_ENGINE);
> }
> 
> The idea is that any per-engine reset that comes in whilst global is in
> progress can skip, and that the global waits for any per-engine reset to
> finish before clobbering state. The window in which global reset
> completes and a new hang is detected before we clear the bits, I am
> judiciously ignoring. Also this should allow us to perform parallel
> resets between engines. (Now that's a selling point!)

Fair enough, I can change it to support parallel resets.

One thing I forgot to ask, what should I do with the error/reset 
uevents? As it is, we will only tell userspace in case of global reset.

Thanks