[PATCH] drm/xe/guc_submit: improve schedule disable error logging

Mon Sep 30 10:00:55 UTC 2024

On 28/09/2024 00:05, John Harrison wrote:
> On 9/27/2024 06:35, Matthew Auld wrote:
>> A few things here. Make the two prints consistent (and distinct), print
>> the guc_id, and finally dump the CT queues. It should be possible to
>> spot the guc_id in the CT queue dump, and for example see that host side
>> has yet to process the response for the schedule disable, or see that
>> GuC is yet to send it, to help narrow things down if we trigger the
>> timeout.
> Where are you seeing these failures? Is there an understanding of why? 
> Or is this patch basically a "we have no idea what is going on, so get 
> better logs out of CI" type thing? In which case you really want is to 
> generate a devcoredump (with my debug improvements patch set to include 
> the GuC log and such like) and to get CI to give you the core dumps back.

Yeah, patch is "we have no idea what is going on, so get better logs out 
of CI".

 From https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1638, one 
example failure: 
https://intel-gfx-ci.01.org/tree/intel-xe/xe-1873-c689a348137cb6f8934a9be49438bafe413b97d5/re-bmg-5/igt@xe_exec_fault_mode@many-execqueues-invalid-userptr-fault.html

devcoredump wired up to CI with everything thrown in sounds good.

> 
> And maybe this is related to the fix from Badal: "drm/xe/guc: In 
> guc_ct_send_recv flush g2h worker if g2h resp times out"? We have seen 
> problems where the worker is simply not getting to run before the 
> timeout expires.

I don't think the schedule disable is using guc_ct_send_recv() 
interface, so I don't think is related but not 100% sure.

> 
> John.
> 
>>
>> References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1638
>> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
>> Cc: Matthew Brost <matthew.brost at intel.com>
>> Cc: Nirmoy Das <nirmoy.das at intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_guc_submit.c | 17 ++++++++++++++---
>>   1 file changed, 14 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c 
>> b/drivers/gpu/drm/xe/xe_guc_submit.c
>> index 80062e1d3f66..52ed7c0043f9 100644
>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>> @@ -977,7 +977,12 @@ static void xe_guc_exec_queue_lr_cleanup(struct 
>> work_struct *w)
>>                        !exec_queue_pending_disable(q) ||
>>                        guc_read_stopped(guc), HZ * 5);
>>           if (!ret) {
>> -            drm_warn(&xe->drm, "Schedule disable failed to respond");
>> +            struct xe_gt *gt = guc_to_gt(guc);
>> +            struct drm_printer p = xe_gt_err_printer(gt);
>> +
>> +            xe_gt_warn(gt, "%s schedule disable failed to respond 
>> guc_id=%d",
>> +                   __func__, ge->id);
>> +            xe_guc_ct_print(&guc->ct, &p, false);
>>               xe_sched_submission_start(sched);
>>               xe_gt_reset_async(q->gt);
>>               return;
>> @@ -1177,8 +1182,14 @@ guc_exec_queue_timedout_job(struct 
>> drm_sched_job *drm_job)
>>                        guc_read_stopped(guc), HZ * 5);
>>           if (!ret || guc_read_stopped(guc)) {
>>   trigger_reset:
>> -            if (!ret)
>> -                xe_gt_warn(guc_to_gt(guc), "Schedule disable failed 
>> to respond");
>> +            if (!ret) {
>> +                struct xe_gt *gt = guc_to_gt(guc);
>> +                struct drm_printer p = xe_gt_err_printer(gt);
>> +
>> +                xe_gt_warn(gt, "%s schedule disable failed to respond 
>> guc_id=%d",
>> +                       __func__, q->guc->id);
>> +                xe_guc_ct_print(&guc->ct, &p, true);
>> +            }
>>               set_exec_queue_extra_ref(q);
>>               xe_exec_queue_get(q);    /* GT reset owns this */
>>               set_exec_queue_banned(q);
>