[Intel-gfx] [PATCH v7 02/20] drm/i915: Modify error handler for per engine hang recovery
Michel Thierry
michel.thierry at intel.com
Fri May 12 20:55:11 UTC 2017
On 5/8/2017 11:31 AM, Michel Thierry wrote:
> On 4/29/2017 7:19 AM, Chris Wilson wrote:
>> On Thu, Apr 27, 2017 at 04:12:42PM -0700, Michel Thierry wrote:
>>> From: Arun Siluvery <arun.siluvery at linux.intel.com>
>>>
...
>>> + }
>>> +
>>> intel_prepare_reset(dev_priv);
>>>
>>> set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
>>> @@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct
>>> drm_i915_private *dev_priv)
>>> kobject_uevent_env(kobj,
>>> KOBJ_CHANGE, reset_done_event);
>>>
>>> +finish:
>>> /*
>>> * Note: The wake_up also serves as a memory barrier so that
>>> * waiters see the updated value of the dev_priv->gpu_error.
>>> @@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private
>>> *dev_priv,
>>> &dev_priv->gpu_error.flags))
>>> goto out;
>>>
>>> - i915_reset_and_wakeup(dev_priv);
>>> + i915_reset_and_wakeup(dev_priv, engine_mask);
>>
>> ? You don't need to wakeup the struct_mutex so we don't need this after
>> per-engine resets. Time to split up i915_reset_and_wakeup(), because we
>> certainly shouldn't be calling intel_finish_reset() without first calling
>> intel_prepare_reset(). Which is right here in my tree...
>>
>
> Looking at your tree, it wouldn't call finish_reset there either, only
> these two are called after a successful reset:
>
> finish:
> clear_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags);
> wake_up_all(&dev_priv->gpu_error.reset_queue);
>
> But you're right, we only need to clear the error flag, no need to call
> wake_up_all.
>
> Should I move the per-engine reset to i915_handle_error, and then leave
> i915_reset_and_wakeup just for full resets?
> That would also make the promotion from per-engine to global look a bit
> 'clearer'.
>
I just noticed an issue if I don't call wake_up_all. There can be
someone else waiting for the reset to complete
(i915_mutex_lock_interruptible -> i915_gem_wait_for_error).
I915_RESET_BACKOFF has/had 2 roles, stop any other user to grab the
struct mutex (which we won't need in reset-engine) and prevent two
concurrent reset attempts (which we still want). Time to add a new flag
for the later? (I915_RESET_ENGINE_IN_PROGRESS?)
Here's an example without calling wake_up_all (10s timeout):
[ 126.816054] [drm:i915_reset_engine [i915]] resetting rcs0
...
[ 137.499910] [IGT] gem_ringfill: exiting, ret=0
Compared to the one that does,
[ 69.799519] [drm:i915_reset_engine [i915]] resetting rcs0
...
[ 69.801335] [IGT] gem_tdr: exiting, ret=0
Thanks,
-Michel
More information about the Intel-gfx
mailing list