[Intel-gfx] [PATCH 2/4] drm/i915: Allow error capture of a pending request
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Fri Jan 13 09:10:22 UTC 2023
On 12/01/2023 20:46, John Harrison wrote:
> On 1/12/2023 02:06, Tvrtko Ursulin wrote:
>> On 12/01/2023 02:53, John.C.Harrison at Intel.com wrote:
>>> From: John Harrison <John.C.Harrison at Intel.com>
>>>
>>> A hang situation has been observed where the only requests on the
>>> context were either completed or not yet started according to the
>>> breaadcrumbs. However, the register state claimed a batch was (maybe)
>>> in progress. So, allow capture of the pending request on the grounds
>>> that this might be better than nothing.
>>>
>>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>>> ---
>>> drivers/gpu/drm/i915/i915_gpu_error.c | 8 +++-----
>>> 1 file changed, 3 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c
>>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>>> index bd2cf7d235df0..2e338a9667a4b 100644
>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>>> @@ -1628,11 +1628,9 @@ capture_engine(struct intel_engine_cs *engine,
>>> if (ce) {
>>> intel_engine_clear_hung_context(engine);
>>> rq = intel_context_find_active_request(ce);
>>> - if (rq && !i915_request_started(rq)) {
>>> - drm_info(&engine->gt->i915->drm, "Got hung context on %s
>>> with no active request!\n",
>>> - engine->name);
>>> - rq = NULL;
>>> - }
>>> + if (rq && !i915_request_started(rq))
>>> + drm_info(&engine->gt->i915->drm, "Confused - active
>>> request not yet started: %lld:%lld, ce = 0x%04X/%s!\n",
>>> + rq->fence.context, rq->fence.seqno, ce->guc_id.id,
>>> engine->name);
>>
>> Ah you change active to started in this patch! :)
> Yeah, I'm wanting to keep these two patches separate. This one is a more
> questionable change in actual behaviour. The previous patch just allows
> capturing the context when the request has been rejected. Whereas this
> one changes the request acceptance criteria. With the potential to start
> blaming innocent requests. It seems plausible to me, especially with the
> warning message. We know the context owning the request is guilty so why
> wouldn't we blame that request just because the tracking is off (maybe
> due to some driver bug). But I could see someone objecting on grounds of
> being super strict about who/what gets blamed for a hang and either
> nacks or maybe wants this change reverted some time later.
>
>>
>> I suggest no "ce" in user visible messages and maybe stick with the
>> convention grep suggest is already established:
>>
>> "Hung context with active request %lld:%lld [0x%04X] not started!"
>>
> Are you also meaning to drop the engine name? I think it is important to
> keep the '%s' in there somewhere.
No sorry, just an oversight.
"Hung context on %s with active request %lld:%lld [0x%04X] not started!"
Doesn't have to be exactly that, only trying to illustrate what style
looks better to me when user facing - not mentioning confusing and fewer
special characters.
Regards,
Tvrtko
>
> John.
>
>
>> Regards,
>>
>> Tvrtko
>>
>>> } else {
>>> /*
>>> * Getting here with GuC enabled means it is a forced error
>>> capture
>
More information about the Intel-gfx
mailing list