[Intel-gfx] [PATCH 2/4] drm/i915: Allow error capture of a pending request

Tvrtko Ursulin tvrtko.ursulin at linux.intel.com
Fri Jan 13 09:10:22 UTC 2023


On 12/01/2023 20:46, John Harrison wrote:
> On 1/12/2023 02:06, Tvrtko Ursulin wrote:
>> On 12/01/2023 02:53, John.C.Harrison at Intel.com wrote:
>>> From: John Harrison <John.C.Harrison at Intel.com>
>>>
>>> A hang situation has been observed where the only requests on the
>>> context were either completed or not yet started according to the
>>> breaadcrumbs. However, the register state claimed a batch was (maybe)
>>> in progress. So, allow capture of the pending request on the grounds
>>> that this might be better than nothing.
>>>
>>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/i915_gpu_error.c | 8 +++-----
>>>   1 file changed, 3 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
>>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>>> index bd2cf7d235df0..2e338a9667a4b 100644
>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>>> @@ -1628,11 +1628,9 @@ capture_engine(struct intel_engine_cs *engine,
>>>       if (ce) {
>>>           intel_engine_clear_hung_context(engine);
>>>           rq = intel_context_find_active_request(ce);
>>> -        if (rq && !i915_request_started(rq)) {
>>> -            drm_info(&engine->gt->i915->drm, "Got hung context on %s 
>>> with no active request!\n",
>>> -                 engine->name);
>>> -            rq = NULL;
>>> -        }
>>> +        if (rq && !i915_request_started(rq))
>>> +            drm_info(&engine->gt->i915->drm, "Confused - active 
>>> request not yet started: %lld:%lld, ce = 0x%04X/%s!\n",
>>> +                 rq->fence.context, rq->fence.seqno, ce->guc_id.id, 
>>> engine->name);
>>
>> Ah you change active to started in this patch! :)
> Yeah, I'm wanting to keep these two patches separate. This one is a more 
> questionable change in actual behaviour. The previous patch just allows 
> capturing the context when the request has been rejected. Whereas this 
> one changes the request acceptance criteria. With the potential to start 
> blaming innocent requests. It seems plausible to me, especially with the 
> warning message. We know the context owning the request is guilty so why 
> wouldn't we blame that request just because the tracking is off (maybe 
> due to some driver bug). But I could see someone objecting on grounds of 
> being super strict about who/what gets blamed for a hang and either 
> nacks or maybe wants this change reverted some time later.
> 
>>
>> I suggest no "ce" in user visible messages and maybe stick with the 
>> convention grep suggest is already established:
>>
>> "Hung context with active request %lld:%lld [0x%04X] not started!"
>>
> Are you also meaning to drop the engine name? I think it is important to 
> keep the '%s' in there somewhere.

No sorry, just an oversight.

"Hung context on %s with active request %lld:%lld [0x%04X] not started!"

Doesn't have to be exactly that, only trying to illustrate what style 
looks better to me when user facing - not mentioning confusing and fewer 
special characters.

Regards,

Tvrtko

> 
> John.
> 
> 
>> Regards,
>>
>> Tvrtko
>>
>>>       } else {
>>>           /*
>>>            * Getting here with GuC enabled means it is a forced error 
>>> capture
> 


More information about the Intel-gfx mailing list