[PATCH] drm/xe: Add engine name to the engine reset and cat-err log

John Harrison john.c.harrison at intel.com
Mon Apr 29 21:17:31 UTC 2024


On 4/25/2024 14:16, Nirmoy Das wrote:
> Hi Matt,
>
> On 4/25/2024 10:31 PM, Matthew Brost wrote:
>> On Thu, Apr 25, 2024 at 09:11:22PM +0200, Nirmoy Das wrote:
>>> Hi Matt,
>>>
>>> On 4/25/2024 7:46 PM, Matthew Brost wrote:
>>>> On Thu, Apr 25, 2024 at 02:18:56PM +0200, Nirmoy Das wrote:
>>>>> Add engine name to the engine reset and cat error log
>>>>> which should be useful while debugging.
>>>>>
>>>>> Signed-off-by: Nirmoy Das <nirmoy.das at intel.com>
>>>>> ---
>>>>>    drivers/gpu/drm/xe/xe_guc_submit.c | 5 +++--
>>>>>    1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c 
>>>>> b/drivers/gpu/drm/xe/xe_guc_submit.c
>>>>> index c7d38469fb46..245e29d095c0 100644
>>>>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
>>>>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>>>>> @@ -1655,7 +1655,7 @@ int xe_guc_exec_queue_reset_handler(struct 
>>>>> xe_guc *guc, u32 *msg, u32 len)
>>>>>        if (unlikely(!q))
>>>>>            return -EPROTO;
>>>>> -    drm_info(&xe->drm, "Engine reset: guc_id=%d", guc_id);
>>>>> +    drm_info(&xe->drm, "Engine reset: name=%s, guc_id=%d", 
>>>>> q->hwe->name, guc_id);
>>>> I don't think q->hwe->name name is useful as it might not actually be
>>>> exec queue is running. I'd drop that, and replace with string 
>>>> indicating
Not following this. What is q->hwe->name if it is not the name of the 
exec queue which owns the given guc_id?

Note that the notification is officially about a context reset not an 
engine reset. The actual implementation mechanism might be an engine 
reset but the intent and purpose is to reset the specific context as 
identified by the guc_id field. That is, the error report is not that 
BCS37 failed and needed to be reset, but that context 43 failed and 
needed to be reset and the fact that it happened to executing on BCS37 
at the time is more coincidence than cause.

If the q name is not meaningful and just some generic string then maybe 
the better fix would be to make that name more useful?

I would also change the base string to be 'Context reset' rather than 
'Engine reset'.


>>>> the hardware engine class.
>>> I will resend with engine class instead.
>>>
>> Maybe include the logical mask of exec queue too.
>
> Will do that!
>
> Nirmoy
>
>>
>> Matt
>>
>>> Thanks,
>>>
>>> Nirmoy
>>>
>>>>>        /* FIXME: Do error capture, most likely async */
>>>>> @@ -1690,7 +1690,8 @@ int 
>>>>> xe_guc_exec_queue_memory_cat_error_handler(struct xe_guc *guc, u32 
>>>>> *msg,
>>>>>        if (unlikely(!q))
>>>>>            return -EPROTO;
>>>>> -    drm_dbg(&xe->drm, "Engine memory cat error: guc_id=%d", guc_id);
>>>>> +    drm_dbg(&xe->drm, "Engine memory cat error: name=%s, guc_id=%d",
>>>>> +        q->hwe->name, guc_id);
>>>> Same here.
>>>>
Indeed. This is also not about an engine failing and create a memory 
error. It is about a context attempting to access an invalid address.

John.

>>>> Matt
>>>>
>>>>> trace_xe_exec_queue_memory_cat_error(q);
>>>>>        /* Treat the same as engine reset */
>>>>> -- 
>>>>> 2.42.0
>>>>>



More information about the Intel-xe mailing list