[Intel-gfx] [PATCH 4/4] drm/i915/gem: Always test execution status on closing the context

Fri Sep 25 13:23:32 UTC 2020

On 25/09/2020 11:05, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2020-09-24 15:26:56)
>>
>> On 16/09/2020 10:42, Chris Wilson wrote:
>>> Verify that if a context is active at the time it is closed, that it is
>>> either persistent and preemptible (with hangcheck running) or it shall
>>> be removed from execution.
>>>
>>> Fixes: 9a40bddd47ca ("drm/i915/gt: Expose heartbeat interval via sysfs")
>>> Testcase: igt/gem_ctx_persistence/heartbeat-close
>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
>>> Cc: <stable at vger.kernel.org> # v5.7+
>>> ---
>>>    drivers/gpu/drm/i915/gem/i915_gem_context.c | 48 +++++----------------
>>>    1 file changed, 10 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c
>>> index a548626fa8bc..4fd38101bb56 100644
>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
>>> @@ -390,24 +390,6 @@ __context_engines_static(const struct i915_gem_context *ctx)
>>>        return rcu_dereference_protected(ctx->engines, true);
>>>    }
>>>    
>>> -static bool __reset_engine(struct intel_engine_cs *engine)
>>> -{
>>> -     struct intel_gt *gt = engine->gt;
>>> -     bool success = false;
>>> -
>>> -     if (!intel_has_reset_engine(gt))
>>> -             return false;
>>> -
>>> -     if (!test_and_set_bit(I915_RESET_ENGINE + engine->id,
>>> -                           &gt->reset.flags)) {
>>> -             success = intel_engine_reset(engine, NULL) == 0;
>>> -             clear_and_wake_up_bit(I915_RESET_ENGINE + engine->id,
>>> -                                   &gt->reset.flags);
>>> -     }
>>> -
>>> -     return success;
>>> -}
>>> -
>>>    static void __reset_context(struct i915_gem_context *ctx,
>>>                            struct intel_engine_cs *engine)
>>>    {
>>> @@ -431,12 +413,7 @@ static bool __cancel_engine(struct intel_engine_cs *engine)
>>>         * kill the banned context, we fallback to doing a local reset
>>>         * instead.
>>>         */
>>> -     if (IS_ACTIVE(CONFIG_DRM_I915_PREEMPT_TIMEOUT) &&
>>> -         !intel_engine_pulse(engine))
>>> -             return true;
>>> -
>>> -     /* If we are unable to send a pulse, try resetting this engine. */
>>> -     return __reset_engine(engine);
>>> +     return intel_engine_pulse(engine) == 0;
>>
>> Where is the path now which actually resets the currently running
>> workload (engine) of a non-persistent context? Pulse will be sent and
>> then rely on hangcheck but otherwise let it run?
> 
> If the pulse fails, we just call intel_handle_error() on the engine. On
> looking at this code again, I could not justify the open-coding of the
> engine reset fragment of the general error handler, especially as we end
> up taking that route anyway for device resets. (Unlike inside the
> tasklet, there's no atomicity concerns on this engine-reset path.)

I think yesterday I got lost in boolean logic and flows here. Today it 
looks fine. Bool ban will be true and code will indeed enter the 
__kill_context path.

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>

Regards,

Tvrtko