[PATCH 2/2] drm/xe/guc: Support crash dump notification from GuC

John Harrison john.c.harrison at intel.com
Sat Nov 9 00:39:49 UTC 2024


On 11/8/2024 15:56, Matthew Brost wrote:
> On Fri, Nov 08, 2024 at 03:51:12PM -0800, John Harrison wrote:
>> On 11/8/2024 15:35, Matthew Brost wrote:
>>> On Fri, Nov 08, 2024 at 01:27:37PM -0800, John.C.Harrison at Intel.com wrote:
>>>> From: John Harrison <John.C.Harrison at Intel.com>
>>>>
>>>> Add support for the two crash dump notifications from GuC. Either one
>>>> means GuC is toast, so just capture state trigger a reset.
>>>>
>>>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>>>> ---
>>>>    drivers/gpu/drm/xe/xe_guc_ct.c | 23 +++++++++++++++++++++++
>>>>    1 file changed, 23 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
>>>> index 63bd91963eb1..7eb175a0b874 100644
>>>> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
>>>> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
>>>> @@ -54,6 +54,7 @@ enum {
>>>>    	CT_DEAD_PARSE_G2H_UNKNOWN,		/* 0x1000 */
>>>>    	CT_DEAD_PARSE_G2H_ORIGIN,		/* 0x2000 */
>>>>    	CT_DEAD_PARSE_G2H_TYPE,			/* 0x4000 */
>>>> +	CT_DEAD_CRASH,				/* 0x8000 */
>>>>    };
>>>>    static void ct_dead_worker_func(struct work_struct *w);
>>>> @@ -1120,6 +1121,24 @@ static int parse_g2h_event(struct xe_guc_ct *ct, u32 *msg, u32 len)
>>>>    	return 0;
>>>>    }
>>>> +static int guc_crash_process_msg(struct xe_guc_ct *ct, u32 action)
>>>> +{
>>>> +	struct xe_gt *gt = ct_to_gt(ct);
>>>> +
>>>> +	if (action == XE_GUC_ACTION_NOTIFY_CRASH_DUMP_POSTED)
>>>> +		xe_gt_err(gt, "GuC Crash dump notification\n");
>>>> +	else if (action == XE_GUC_ACTION_NOTIFY_EXCEPTION)
>>>> +		xe_gt_err(gt, "GuC Exception notification\n");
>>>> +	else
>>>> +		xe_gt_err(gt, "Unknown GuC crash notification: 0x%04X\n", action);
>>>> +
>>>> +	CT_DEAD(ct, NULL, CRASH);
>>>> +
>>>> +	kick_reset(ct);
>>> Side note, we may want to wire a devcoredump to a GT reset too.
>> I have a work-in-progress series to allow creating a devcoredump without a
>> scheduler job. I assume that would be a re-requisite to creating one from an
>> arbitrary GT reset. Certainly coming in from an async event such as this,
>> there is no scheduler job to use. Hoping to post that soon. Should be easy
>> enough to connect it to the GT reset then.
>>
> We appear to be stepping on each other feet, just posted this one...
>
> https://patchwork.freedesktop.org/series/141110/
I did see that. Haven't had a chance to look in detail yet. But I don't 
think it really affects the changes I'm doing. Either sched_job or 
exec_queue doesn't make a difference, we don't have access to either 
outside of the submission path. My other changes are more about 
splitting the print code up a bit to allow dump via the dmesg helper 
(for internal developer use) as well as via sysfs. The bits I'm missing 
at the moment is how to get to engine state without having a job/queue 
to start from. I was also wanting to allow capture of multiple GTs in a 
single dump. I'll see if I can quickly clean up what I've got so far and 
post it so you can take a look.

John.

>
> I had to code these locally while working on something else so threw
> them on the list.
>
> Let me know if I missed anything there or if you want me to hold up
> merging as I was planning on merging once CI is clean.
>
> Also agree it is a small rework (don't assume we have a queue) on top of
> this to connect this to a GT reset.
>
> Matt
>
>> John.
>>
>>> Anyways this patch LGTM. With that:
>>> Reviewed-by: Matthew Brost <matthew.brost at intel.com>
>>>
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>>    static int parse_g2h_response(struct xe_guc_ct *ct, u32 *msg, u32 len)
>>>>    {
>>>>    	struct xe_gt *gt =  ct_to_gt(ct);
>>>> @@ -1294,6 +1313,10 @@ static int process_g2h_msg(struct xe_guc_ct *ct, u32 *msg, u32 len)
>>>>    	case GUC_ACTION_GUC2PF_ADVERSE_EVENT:
>>>>    		ret = xe_gt_sriov_pf_monitor_process_guc2pf(gt, hxg, hxg_len);
>>>>    		break;
>>>> +	case XE_GUC_ACTION_NOTIFY_CRASH_DUMP_POSTED:
>>>> +	case XE_GUC_ACTION_NOTIFY_EXCEPTION:
>>>> +		ret = guc_crash_process_msg(ct, action);
>>>> +		break;
>>>>    	default:
>>>>    		xe_gt_err(gt, "unexpected G2H action 0x%04x\n", action);
>>>>    	}
>>>> -- 
>>>> 2.47.0
>>>>



More information about the Intel-xe mailing list