[Intel-gfx] [PATCH 2/2] drm/i915/guc: Look for a guilty context when an engine reset fails
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Thu Dec 1 10:21:02 UTC 2022
On 30/11/2022 21:04, John Harrison wrote:
> On 11/30/2022 00:30, Tvrtko Ursulin wrote:
>> On 29/11/2022 21:12, John.C.Harrison at Intel.com wrote:
>>> From: John Harrison <John.C.Harrison at Intel.com>
>>>
>>> Engine resets are supposed to never happen. But in the case when one
>>
>> Engine resets or engine reset failures? Hopefully the latter.
>>
> Oops. Yes, that was meant to say "engine resets are never supposed to
> fail."
>
>>> does (due to unknwon reasons that normally come down to a missing
> unknwon -> unknown
>
>>> w/a), it is useful to get as much information out of the system as
>>> possible. Given that the GuC effectively dies on such a situation, it
>>> is not possible to get a guilty context notification back. So do a
>>> manual search instead. Given that GuC is dead, this is safe because
>>> GuC won't be changing the engine state asynchronously.
>>>
>>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>>> ---
>>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 ++++++++++++++-
>>> 1 file changed, 14 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index 0a42f1807f52c..c82730804a1c4 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -4751,11 +4751,24 @@ static void reset_fail_worker_func(struct
>>> work_struct *w)
>>> guc->submission_state.reset_fail_mask = 0;
>>> spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>> - if (likely(reset_fail_mask))
>>> + if (likely(reset_fail_mask)) {
>>> + struct intel_engine_cs *engine;
>>> + enum intel_engine_id id;
>>> +
>>> + /*
>>> + * GuC is toast at this point - it dead loops after sending
>>> the failed
>>> + * reset notification. So need to manually determine the
>>> guilty context.
>>> + * Note that it should be safe/reliable to do this here
>>> because the GuC
>>> + * is toast and will not be scheduling behind the KMD's back.
>>> + */
>>> + for_each_engine_masked(engine, gt, reset_fail_mask, id)
>>> + intel_guc_find_hung_context(engine);
>>> +
>>> intel_gt_handle_error(gt, reset_fail_mask,
>>> I915_ERROR_CAPTURE,
>>> "GuC failed to reset engine mask=0x%x\n",
>>> reset_fail_mask);
>>
>> If GuC is defined by ABI contract to be dead, should the flow be
>> attempting to do a full GPU reset here, or maybe it happens somewhere
>> else as a consequence anyway? (In which case is the engine reset here
>> even needed?)
> This is a full GT reset. i915 is not allowed to perform an engine reset
> when using GuC submission. Those can only be done by GuC. So any forced
> reset by i915 will be escalated to full GT internally.
Okay, I saw passing in of the engine mask and drew the wrong conclusion.
Regards,
Tvrtko
More information about the dri-devel
mailing list