[Intel-gfx] [PATCH] drm/i915/guc: Log engine resets

Tvrtko Ursulin tvrtko.ursulin at linux.intel.com
Mon Dec 20 15:00:53 UTC 2021


On 17/12/2021 16:22, Matthew Brost wrote:
> On Fri, Dec 17, 2021 at 12:15:53PM +0000, Tvrtko Ursulin wrote:
>>
>> On 14/12/2021 15:07, Tvrtko Ursulin wrote:
>>> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>
>>> Log engine resets done by the GuC firmware in the similar way it is done
>>> by the execlists backend.
>>>
>>> This way we have notion of where the hangs are before the GuC gains
>>> support for proper error capture.
>>
>> Ping - any interest to log this info?
>>
>> All there currently is a non-descriptive "[drm] GPU HANG: ecode
>> 12:0:00000000".
>>
> 
> Yea, this could be helpful. One suggestion below.
> 
>> Also, will GuC be reporting the reason for the engine reset at any point?
>>
> 
> We are working on the error state capture, presumably the registers will
> give a clue what caused the hang.
> 
> As for the GuC providing a reason, that isn't defined in the interface
> but that is decent idea to provide a hint in G2H what the issue was. Let
> me run that by the i915 GuC developers / GuC firmware team and see what
> they think.
> 
>> Regards,
>>
>> Tvrtko
>>
>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>> Cc: John Harrison <John.C.Harrison at Intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 12 +++++++++++-
>>>    1 file changed, 11 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index 97311119da6f..51512123dc1a 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -11,6 +11,7 @@
>>>    #include "gt/intel_context.h"
>>>    #include "gt/intel_engine_pm.h"
>>>    #include "gt/intel_engine_heartbeat.h"
>>> +#include "gt/intel_engine_user.h"
>>>    #include "gt/intel_gpu_commands.h"
>>>    #include "gt/intel_gt.h"
>>>    #include "gt/intel_gt_clock_utils.h"
>>> @@ -3934,9 +3935,18 @@ static void capture_error_state(struct intel_guc *guc,
>>>    {
>>>    	struct intel_gt *gt = guc_to_gt(guc);
>>>    	struct drm_i915_private *i915 = gt->i915;
>>> -	struct intel_engine_cs *engine = __context_to_physical_engine(ce);
>>> +	struct intel_engine_cs *engine = ce->engine;
>>>    	intel_wakeref_t wakeref;
>>> +	if (intel_engine_is_virtual(engine)) {
>>> +		drm_notice(&i915->drm, "%s class, engines 0x%x; GuC engine reset\n",
>>> +			   intel_engine_class_repr(engine->class),
>>> +			   engine->mask);
>>> +		engine = guc_virtual_get_sibling(engine, 0);
>>> +	} else {
>>> +		drm_notice(&i915->drm, "%s GuC engine reset\n", engine->name);
> 
> Probably include the guc_id of the context too then?

Is the guc id stable and useful on its own - who would be the user?

Regards,

Tvrtko


More information about the dri-devel mailing list