[Intel-gfx] [PATCH 2/2] drm/i915/guc: Look for a guilty context when an engine reset fails

Tvrtko Ursulin tvrtko.ursulin at linux.intel.com
Thu Dec 1 10:21:02 UTC 2022


On 30/11/2022 21:04, John Harrison wrote:
> On 11/30/2022 00:30, Tvrtko Ursulin wrote:
>> On 29/11/2022 21:12, John.C.Harrison at Intel.com wrote:
>>> From: John Harrison <John.C.Harrison at Intel.com>
>>>
>>> Engine resets are supposed to never happen. But in the case when one
>>
>> Engine resets or engine reset failures? Hopefully the latter.
>>
> Oops. Yes, that was meant to say "engine resets are never supposed to 
> fail."
> 
>>> does (due to unknwon reasons that normally come down to a missing
> unknwon -> unknown
> 
>>> w/a), it is useful to get as much information out of the system as
>>> possible. Given that the GuC effectively dies on such a situation, it
>>> is not possible to get a guilty context notification back. So do a
>>> manual search instead. Given that GuC is dead, this is safe because
>>> GuC won't be changing the engine state asynchronously.
>>>
>>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 ++++++++++++++-
>>>   1 file changed, 14 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index 0a42f1807f52c..c82730804a1c4 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -4751,11 +4751,24 @@ static void reset_fail_worker_func(struct 
>>> work_struct *w)
>>>       guc->submission_state.reset_fail_mask = 0;
>>>       spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>   -    if (likely(reset_fail_mask))
>>> +    if (likely(reset_fail_mask)) {
>>> +        struct intel_engine_cs *engine;
>>> +        enum intel_engine_id id;
>>> +
>>> +        /*
>>> +         * GuC is toast at this point - it dead loops after sending 
>>> the failed
>>> +         * reset notification. So need to manually determine the 
>>> guilty context.
>>> +         * Note that it should be safe/reliable to do this here 
>>> because the GuC
>>> +         * is toast and will not be scheduling behind the KMD's back.
>>> +         */
>>> +        for_each_engine_masked(engine, gt, reset_fail_mask, id)
>>> +            intel_guc_find_hung_context(engine);
>>> +
>>>           intel_gt_handle_error(gt, reset_fail_mask,
>>>                         I915_ERROR_CAPTURE,
>>>                         "GuC failed to reset engine mask=0x%x\n",
>>>                         reset_fail_mask);
>>
>> If GuC is defined by ABI contract to be dead, should the flow be 
>> attempting to do a full GPU reset here, or maybe it happens somewhere 
>> else as a consequence anyway? (In which case is the engine reset here 
>> even needed?)
> This is a full GT reset. i915 is not allowed to perform an engine reset 
> when using GuC submission. Those can only be done by GuC. So any forced 
> reset by i915 will be escalated to full GT internally.

Okay, I saw passing in of the engine mask and drew the wrong conclusion.

Regards,

Tvrtko


More information about the Intel-gfx mailing list