[Intel-gfx] [PATCH v7 02/20] drm/i915: Modify error handler for per engine hang recovery

Fri May 12 20:55:11 UTC 2017

On 5/8/2017 11:31 AM, Michel Thierry wrote:
> On 4/29/2017 7:19 AM, Chris Wilson wrote:
>> On Thu, Apr 27, 2017 at 04:12:42PM -0700, Michel Thierry wrote:
>>> From: Arun Siluvery <arun.siluvery at linux.intel.com>
>>>
...
>>> +    }
>>> +
>>>      intel_prepare_reset(dev_priv);
>>>
>>>      set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
>>> @@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct
>>> drm_i915_private *dev_priv)
>>>          kobject_uevent_env(kobj,
>>>                     KOBJ_CHANGE, reset_done_event);
>>>
>>> +finish:
>>>      /*
>>>       * Note: The wake_up also serves as a memory barrier so that
>>>       * waiters see the updated value of the dev_priv->gpu_error.
>>> @@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private
>>> *dev_priv,
>>>                   &dev_priv->gpu_error.flags))
>>>          goto out;
>>>
>>> -    i915_reset_and_wakeup(dev_priv);
>>> +    i915_reset_and_wakeup(dev_priv, engine_mask);
>>
>> ? You don't need to wakeup the struct_mutex so we don't need this after
>> per-engine resets. Time to split up i915_reset_and_wakeup(), because we
>> certainly shouldn't be calling intel_finish_reset() without first calling
>> intel_prepare_reset(). Which is right here in my tree...
>>
>
> Looking at your tree, it wouldn't call finish_reset there either, only
> these two are called after a successful reset:
>
> finish:
>     clear_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags);
>     wake_up_all(&dev_priv->gpu_error.reset_queue);
>
> But you're right, we only need to clear the error flag, no need to call
> wake_up_all.
>
> Should I move the per-engine reset to i915_handle_error, and then leave
> i915_reset_and_wakeup just for full resets?
> That would also make the promotion from per-engine to global look a bit
> 'clearer'.
>

I just noticed an issue if I don't call wake_up_all. There can be 
someone else waiting for the reset to complete 
(i915_mutex_lock_interruptible -> i915_gem_wait_for_error).

I915_RESET_BACKOFF has/had 2 roles, stop any other user to grab the 
struct mutex (which we won't need in reset-engine) and prevent two 
concurrent reset attempts (which we still want). Time to add a new flag 
for the later? (I915_RESET_ENGINE_IN_PROGRESS?)

Here's an example without calling wake_up_all (10s timeout):
[  126.816054] [drm:i915_reset_engine [i915]] resetting rcs0
...
[  137.499910] [IGT] gem_ringfill: exiting, ret=0

Compared to the one that does,
[   69.799519] [drm:i915_reset_engine [i915]] resetting rcs0
...
[   69.801335] [IGT] gem_tdr: exiting, ret=0

Thanks,

-Michel