[PATCH] drm/xe/guc_submit: improve schedule disable error logging

Fri Sep 27 23:05:48 UTC 2024

On 9/27/2024 06:35, Matthew Auld wrote:
> A few things here. Make the two prints consistent (and distinct), print
> the guc_id, and finally dump the CT queues. It should be possible to
> spot the guc_id in the CT queue dump, and for example see that host side
> has yet to process the response for the schedule disable, or see that
> GuC is yet to send it, to help narrow things down if we trigger the
> timeout.
Where are you seeing these failures? Is there an understanding of why? 
Or is this patch basically a "we have no idea what is going on, so get 
better logs out of CI" type thing? In which case you really want is to 
generate a devcoredump (with my debug improvements patch set to include 
the GuC log and such like) and to get CI to give you the core dumps back.

And maybe this is related to the fix from Badal: "drm/xe/guc: In 
guc_ct_send_recv flush g2h worker if g2h resp times out"? We have seen 
problems where the worker is simply not getting to run before the 
timeout expires.

John.

>
> References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1638
> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
> Cc: Matthew Brost <matthew.brost at intel.com>
> Cc: Nirmoy Das <nirmoy.das at intel.com>
> ---
>   drivers/gpu/drm/xe/xe_guc_submit.c | 17 ++++++++++++++---
>   1 file changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
> index 80062e1d3f66..52ed7c0043f9 100644
> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> @@ -977,7 +977,12 @@ static void xe_guc_exec_queue_lr_cleanup(struct work_struct *w)
>   					 !exec_queue_pending_disable(q) ||
>   					 guc_read_stopped(guc), HZ * 5);
>   		if (!ret) {
> -			drm_warn(&xe->drm, "Schedule disable failed to respond");
> +			struct xe_gt *gt = guc_to_gt(guc);
> +			struct drm_printer p = xe_gt_err_printer(gt);
> +
> +			xe_gt_warn(gt, "%s schedule disable failed to respond guc_id=%d",
> +				   __func__, ge->id);
> +			xe_guc_ct_print(&guc->ct, &p, false);
>   			xe_sched_submission_start(sched);
>   			xe_gt_reset_async(q->gt);
>   			return;
> @@ -1177,8 +1182,14 @@ guc_exec_queue_timedout_job(struct drm_sched_job *drm_job)
>   					 guc_read_stopped(guc), HZ * 5);
>   		if (!ret || guc_read_stopped(guc)) {
>   trigger_reset:
> -			if (!ret)
> -				xe_gt_warn(guc_to_gt(guc), "Schedule disable failed to respond");
> +			if (!ret) {
> +				struct xe_gt *gt = guc_to_gt(guc);
> +				struct drm_printer p = xe_gt_err_printer(gt);
> +
> +				xe_gt_warn(gt, "%s schedule disable failed to respond guc_id=%d",
> +					   __func__, q->guc->id);
> +				xe_guc_ct_print(&guc->ct, &p, true);
> +			}
>   			set_exec_queue_extra_ref(q);
>   			xe_exec_queue_get(q);	/* GT reset owns this */
>   			set_exec_queue_banned(q);