[Intel-gfx] [PATCH v7 02/20] drm/i915: Modify error handler for per engine hang recovery

Fri May 12 21:23:57 UTC 2017

On 5/12/2017 2:09 PM, Chris Wilson wrote:
> On Fri, May 12, 2017 at 01:55:11PM -0700, Michel Thierry wrote:
>> On 5/8/2017 11:31 AM, Michel Thierry wrote:
>>> On 4/29/2017 7:19 AM, Chris Wilson wrote:
>>>> On Thu, Apr 27, 2017 at 04:12:42PM -0700, Michel Thierry wrote:
>>>>> From: Arun Siluvery <arun.siluvery at linux.intel.com>
>>>>>
>> ...
>>>>> +    }
>>>>> +
>>>>>     intel_prepare_reset(dev_priv);
>>>>>
>>>>>     set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
>>>>> @@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct
>>>>> drm_i915_private *dev_priv)
>>>>>         kobject_uevent_env(kobj,
>>>>>                    KOBJ_CHANGE, reset_done_event);
>>>>>
>>>>> +finish:
>>>>>     /*
>>>>>      * Note: The wake_up also serves as a memory barrier so that
>>>>>      * waiters see the updated value of the dev_priv->gpu_error.
>>>>> @@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private
>>>>> *dev_priv,
>>>>>                  &dev_priv->gpu_error.flags))
>>>>>         goto out;
>>>>>
>>>>> -    i915_reset_and_wakeup(dev_priv);
>>>>> +    i915_reset_and_wakeup(dev_priv, engine_mask);
>>>>
>>>> ? You don't need to wakeup the struct_mutex so we don't need this after
>>>> per-engine resets. Time to split up i915_reset_and_wakeup(), because we
>>>> certainly shouldn't be calling intel_finish_reset() without first calling
>>>> intel_prepare_reset(). Which is right here in my tree...
>>>>
>>>
>>> Looking at your tree, it wouldn't call finish_reset there either, only
>>> these two are called after a successful reset:
>>>
>>> finish:
>>>    clear_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags);
>>>    wake_up_all(&dev_priv->gpu_error.reset_queue);
>>>
>>> But you're right, we only need to clear the error flag, no need to call
>>> wake_up_all.
>>>
>>> Should I move the per-engine reset to i915_handle_error, and then leave
>>> i915_reset_and_wakeup just for full resets?
>>> That would also make the promotion from per-engine to global look a bit
>>> 'clearer'.
>>>
>>
>> I just noticed an issue if I don't call wake_up_all. There can be
>> someone else waiting for the reset to complete
>> (i915_mutex_lock_interruptible -> i915_gem_wait_for_error).
>>
>> I915_RESET_BACKOFF has/had 2 roles, stop any other user to grab the
>> struct mutex (which we won't need in reset-engine) and prevent two
>> concurrent reset attempts (which we still want). Time to add a new
>> flag for the later? (I915_RESET_ENGINE_IN_PROGRESS?)
>
> Yes, that would be a good idea to avoid dual purposing the bits. Now
> that we do direct resets along the wait path, we can completely drop the
> i915_mutex_interruptible(). (No one else should be holding the mutex
> indefinitely.) I think that's a better approach -- I think we've already
> moved all the EIO magic aware to the ABI points where we deemed it
> necessary.

And it seems to work ok with the new flag and no wake_up. I'll run more 
tests.

Thanks