[Intel-gfx] [RFC 7/7] drm/i915/guc: Print the GuC error capture output register list.

Fri Jan 7 09:03:26 UTC 2022

On 06/01/2022 18:33, Teres Alexis, Alan Previn wrote:
> 
> On Thu, 2022-01-06 at 09:38 +0000, Tvrtko Ursulin wrote:
>> On 05/01/2022 17:30, Teres Alexis, Alan Previn wrote:
>>> On Tue, 2022-01-04 at 13:56 +0000, Tvrtko Ursulin wrote:
>>>>> The flow of events are as below:
>>>>>
>>>>> 1. guc sends notification that an error capture was done and ready to take.
>>>>> 	- at this point we copy the guc error captured dump into an interim store
>>>>> 	  (larger buffer that can hold multiple captures).
>>>>> 2. guc sends notification that a context was reset (after the prior)
>>>>> 	- this triggers a call to i915_gpu_coredump with the corresponding engine-mask
>>>>>              from the context that was reset
>>>>> 	- i915_gpu_coredump proceeds to gather entire gpu state including driver state,
>>>>>              global gpu state, engine state, context vmas and also engine registers. For the
>>>>>              engine registers now call into the guc_capture code which merely needs to verify
>>>>> 	  that GuC had already done a step 1 and we have data ready to be parsed.
>>>>
>>>> What about the time between the actual reset and receiving the context
>>>> reset notification? Latter will contain intel_context->guc_id - can that
>>>> be re-assigned or "retired" in between the two and so cause problems for
>>>> matching the correct (or any) vmas?
>>>>
>>> Not it cannot because its only after the context reset notification that i915 starts
>>> taking action against that cotnext - and even that happens after the i915_gpu_codedump (engine-mask-of-context) happens.
>>> That's what i've observed in the code flow.
>>
>> The fact it is "only after" is exactly why I asked.
>>
>> Reset notification is in a CT queue with other stuff, right? So can be
>> some unrelated time after the actual reset. Could have context be
>> retired in the meantime and guc_id released is the question.
>>
>> Because i915 has no idea there was a reset until this delayed message
>> comes over, but it could see user interrupt signaling end of batch,
>> after the reset has happened, unbeknown to i915, right?
>>
>> Perhaps the answer is guc_id cannot be released via the request retire
>> flows. Or GuC signaling release of guc_id is a thing, which is then
>> ordered via the same CT buffer.
>>
>> I don't know, just asking.
>>
> As long as the context is pinned, the guc-id wont be re-assigned. After a bit of offline brain-dump
> from John Harrison, there are many factors that can keep the context pinned (recounts) including
> new or oustanding requests. So a guc-id can't get re-assigned between a capture-notify and a
> context-reset even if that outstanding request is the only refcount left since it would still
> be considered outstanding by the driver. I also think we may also be talking past each other
> in the sense that the guc-id is something the driver assigns to a context being pinned and only
> the driver can un-assign it (both assigning and unasigning is via H2G interactions).
> I get the sense you are assuming the GuC can un-assign the guc-id's on its own - which isn't
> the case. Apologies if i mis-assumed.

I did not think GuC can re-assign ce->guc_id. I asked about request/context complete/retire happening before reset/capture notification is received.

That would be the time window between the last intel_context_put, so last i915_request_put from retire, at which point AFAICT GuC code releases the guc_id. Execution timeline like:

|------ rq1 ------|------ rq2 ------|
    ^ engine reset		    ^ rq2, rq1 retire, guc id released

                                                           		^ GuC reset notify received - guc_id not known any more?

You are saying something is guaranteed to be holding onto the guc_id at the point of receiving the notification? "There are many factors that can keep the context pinned" - what is it in this case? Or the case cannot happen?

Regards,

Tvrtko