[PATCH] drm/xe/guc_submit: improve schedule disable error logging

Mon Sep 30 22:48:07 UTC 2024

On 9/30/2024 03:00, Matthew Auld wrote:
> On 28/09/2024 00:05, John Harrison wrote:
>> On 9/27/2024 06:35, Matthew Auld wrote:
>>> A few things here. Make the two prints consistent (and distinct), print
>>> the guc_id, and finally dump the CT queues. It should be possible to
>>> spot the guc_id in the CT queue dump, and for example see that host 
>>> side
>>> has yet to process the response for the schedule disable, or see that
>>> GuC is yet to send it, to help narrow things down if we trigger the
>>> timeout.
>> Where are you seeing these failures? Is there an understanding of 
>> why? Or is this patch basically a "we have no idea what is going on, 
>> so get better logs out of CI" type thing? In which case you really 
>> want is to generate a devcoredump (with my debug improvements patch 
>> set to include the GuC log and such like) and to get CI to give you 
>> the core dumps back.
>
> Yeah, patch is "we have no idea what is going on, so get better logs 
> out of CI".
>
> From https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1638, one 
> example failure: 
> https://intel-gfx-ci.01.org/tree/intel-xe/xe-1873-c689a348137cb6f8934a9be49438bafe413b97d5/re-bmg-5/igt@xe_exec_fault_mode@many-execqueues-invalid-userptr-fault.html
>
> devcoredump wired up to CI with everything thrown in sounds good.
Just in general, it would probably be worth generating a devcoredump on 
this failure anyway. You actually have access to a 'q' object at this 
point, so just calling the existing devcoredump code is trivial. 
Although we really need to get 
https://patchwork.freedesktop.org/series/134695/ reviewed and merged for 
the dump to be particularly useful in this kind of 'GuC did not respond' 
error.

But if the buglog repro rate of 29% can be believed then it really 
should be possible to repro this locally and get all the logs out. And 
even to try with a flush work fix/hack to see if that is the problem.

>
>>
>> And maybe this is related to the fix from Badal: "drm/xe/guc: In 
>> guc_ct_send_recv flush g2h worker if g2h resp times out"? We have 
>> seen problems where the worker is simply not getting to run before 
>> the timeout expires.
>
> I don't think the schedule disable is using guc_ct_send_recv() 
> interface, so I don't think is related but not 100% sure.
That just means that it won't benefit from the same fix (aka hack). It 
is entirely possible it is still suffering from the worker thread not 
running in a timely manner. But it would need its own explicit flush and 
retry prior to returning the timeout as it is a different code path.

Although as Matthew B says, if we are seeing the worker being delays for 
a second or more on a regular basis then it suggests that something is 
badly wrong somewhere. Linux is no realtime OS but that kind of system 
burp should not be that frequent!

John.

>
>>
>> John.
>>
>>>
>>> References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1638
>>> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>> Cc: Nirmoy Das <nirmoy.das at intel.com>
>>> ---
>>>   drivers/gpu/drm/xe/xe_guc_submit.c | 17 ++++++++++++++---
>>>   1 file changed, 14 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c 
>>> b/drivers/gpu/drm/xe/xe_guc_submit.c
>>> index 80062e1d3f66..52ed7c0043f9 100644
>>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
>>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>>> @@ -977,7 +977,12 @@ static void xe_guc_exec_queue_lr_cleanup(struct 
>>> work_struct *w)
>>>                        !exec_queue_pending_disable(q) ||
>>>                        guc_read_stopped(guc), HZ * 5);
>>>           if (!ret) {
>>> -            drm_warn(&xe->drm, "Schedule disable failed to respond");
>>> +            struct xe_gt *gt = guc_to_gt(guc);
>>> +            struct drm_printer p = xe_gt_err_printer(gt);
>>> +
>>> +            xe_gt_warn(gt, "%s schedule disable failed to respond 
>>> guc_id=%d",
>>> +                   __func__, ge->id);
>>> +            xe_guc_ct_print(&guc->ct, &p, false);
>>>               xe_sched_submission_start(sched);
>>>               xe_gt_reset_async(q->gt);
>>>               return;
>>> @@ -1177,8 +1182,14 @@ guc_exec_queue_timedout_job(struct 
>>> drm_sched_job *drm_job)
>>>                        guc_read_stopped(guc), HZ * 5);
>>>           if (!ret || guc_read_stopped(guc)) {
>>>   trigger_reset:
>>> -            if (!ret)
>>> -                xe_gt_warn(guc_to_gt(guc), "Schedule disable failed 
>>> to respond");
>>> +            if (!ret) {
>>> +                struct xe_gt *gt = guc_to_gt(guc);
>>> +                struct drm_printer p = xe_gt_err_printer(gt);
>>> +
>>> +                xe_gt_warn(gt, "%s schedule disable failed to 
>>> respond guc_id=%d",
>>> +                       __func__, q->guc->id);
>>> +                xe_guc_ct_print(&guc->ct, &p, true);
>>> +            }
>>>               set_exec_queue_extra_ref(q);
>>>               xe_exec_queue_get(q);    /* GT reset owns this */
>>>               set_exec_queue_banned(q);
>>