[Intel-gfx] [PATCH 2/4] drm/i915: Allow error capture of a pending request
John Harrison
john.c.harrison at intel.com
Thu Jan 12 20:46:27 UTC 2023
On 1/12/2023 02:06, Tvrtko Ursulin wrote:
> On 12/01/2023 02:53, John.C.Harrison at Intel.com wrote:
>> From: John Harrison <John.C.Harrison at Intel.com>
>>
>> A hang situation has been observed where the only requests on the
>> context were either completed or not yet started according to the
>> breaadcrumbs. However, the register state claimed a batch was (maybe)
>> in progress. So, allow capture of the pending request on the grounds
>> that this might be better than nothing.
>>
>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>> ---
>> drivers/gpu/drm/i915/i915_gpu_error.c | 8 +++-----
>> 1 file changed, 3 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c
>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>> index bd2cf7d235df0..2e338a9667a4b 100644
>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>> @@ -1628,11 +1628,9 @@ capture_engine(struct intel_engine_cs *engine,
>> if (ce) {
>> intel_engine_clear_hung_context(engine);
>> rq = intel_context_find_active_request(ce);
>> - if (rq && !i915_request_started(rq)) {
>> - drm_info(&engine->gt->i915->drm, "Got hung context on %s
>> with no active request!\n",
>> - engine->name);
>> - rq = NULL;
>> - }
>> + if (rq && !i915_request_started(rq))
>> + drm_info(&engine->gt->i915->drm, "Confused - active
>> request not yet started: %lld:%lld, ce = 0x%04X/%s!\n",
>> + rq->fence.context, rq->fence.seqno, ce->guc_id.id,
>> engine->name);
>
> Ah you change active to started in this patch! :)
Yeah, I'm wanting to keep these two patches separate. This one is a more
questionable change in actual behaviour. The previous patch just allows
capturing the context when the request has been rejected. Whereas this
one changes the request acceptance criteria. With the potential to start
blaming innocent requests. It seems plausible to me, especially with the
warning message. We know the context owning the request is guilty so why
wouldn't we blame that request just because the tracking is off (maybe
due to some driver bug). But I could see someone objecting on grounds of
being super strict about who/what gets blamed for a hang and either
nacks or maybe wants this change reverted some time later.
>
> I suggest no "ce" in user visible messages and maybe stick with the
> convention grep suggest is already established:
>
> "Hung context with active request %lld:%lld [0x%04X] not started!"
>
Are you also meaning to drop the engine name? I think it is important to
keep the '%s' in there somewhere.
John.
> Regards,
>
> Tvrtko
>
>> } else {
>> /*
>> * Getting here with GuC enabled means it is a forced error
>> capture
More information about the Intel-gfx
mailing list