[PATCH] drm/xe/guc_submit: improve schedule disable error logging
John Harrison
john.c.harrison at intel.com
Mon Sep 30 22:48:07 UTC 2024
On 9/30/2024 03:00, Matthew Auld wrote:
> On 28/09/2024 00:05, John Harrison wrote:
>> On 9/27/2024 06:35, Matthew Auld wrote:
>>> A few things here. Make the two prints consistent (and distinct), print
>>> the guc_id, and finally dump the CT queues. It should be possible to
>>> spot the guc_id in the CT queue dump, and for example see that host
>>> side
>>> has yet to process the response for the schedule disable, or see that
>>> GuC is yet to send it, to help narrow things down if we trigger the
>>> timeout.
>> Where are you seeing these failures? Is there an understanding of
>> why? Or is this patch basically a "we have no idea what is going on,
>> so get better logs out of CI" type thing? In which case you really
>> want is to generate a devcoredump (with my debug improvements patch
>> set to include the GuC log and such like) and to get CI to give you
>> the core dumps back.
>
> Yeah, patch is "we have no idea what is going on, so get better logs
> out of CI".
>
> From https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1638, one
> example failure:
> https://intel-gfx-ci.01.org/tree/intel-xe/xe-1873-c689a348137cb6f8934a9be49438bafe413b97d5/re-bmg-5/igt@xe_exec_fault_mode@many-execqueues-invalid-userptr-fault.html
>
> devcoredump wired up to CI with everything thrown in sounds good.
Just in general, it would probably be worth generating a devcoredump on
this failure anyway. You actually have access to a 'q' object at this
point, so just calling the existing devcoredump code is trivial.
Although we really need to get
https://patchwork.freedesktop.org/series/134695/ reviewed and merged for
the dump to be particularly useful in this kind of 'GuC did not respond'
error.
But if the buglog repro rate of 29% can be believed then it really
should be possible to repro this locally and get all the logs out. And
even to try with a flush work fix/hack to see if that is the problem.
>
>>
>> And maybe this is related to the fix from Badal: "drm/xe/guc: In
>> guc_ct_send_recv flush g2h worker if g2h resp times out"? We have
>> seen problems where the worker is simply not getting to run before
>> the timeout expires.
>
> I don't think the schedule disable is using guc_ct_send_recv()
> interface, so I don't think is related but not 100% sure.
That just means that it won't benefit from the same fix (aka hack). It
is entirely possible it is still suffering from the worker thread not
running in a timely manner. But it would need its own explicit flush and
retry prior to returning the timeout as it is a different code path.
Although as Matthew B says, if we are seeing the worker being delays for
a second or more on a regular basis then it suggests that something is
badly wrong somewhere. Linux is no realtime OS but that kind of system
burp should not be that frequent!
John.
>
>>
>> John.
>>
>>>
>>> References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1638
>>> Signed-off-by: Matthew Auld <matthew.auld at intel.com>
>>> Cc: Matthew Brost <matthew.brost at intel.com>
>>> Cc: Nirmoy Das <nirmoy.das at intel.com>
>>> ---
>>> drivers/gpu/drm/xe/xe_guc_submit.c | 17 ++++++++++++++---
>>> 1 file changed, 14 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
>>> b/drivers/gpu/drm/xe/xe_guc_submit.c
>>> index 80062e1d3f66..52ed7c0043f9 100644
>>> --- a/drivers/gpu/drm/xe/xe_guc_submit.c
>>> +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
>>> @@ -977,7 +977,12 @@ static void xe_guc_exec_queue_lr_cleanup(struct
>>> work_struct *w)
>>> !exec_queue_pending_disable(q) ||
>>> guc_read_stopped(guc), HZ * 5);
>>> if (!ret) {
>>> - drm_warn(&xe->drm, "Schedule disable failed to respond");
>>> + struct xe_gt *gt = guc_to_gt(guc);
>>> + struct drm_printer p = xe_gt_err_printer(gt);
>>> +
>>> + xe_gt_warn(gt, "%s schedule disable failed to respond
>>> guc_id=%d",
>>> + __func__, ge->id);
>>> + xe_guc_ct_print(&guc->ct, &p, false);
>>> xe_sched_submission_start(sched);
>>> xe_gt_reset_async(q->gt);
>>> return;
>>> @@ -1177,8 +1182,14 @@ guc_exec_queue_timedout_job(struct
>>> drm_sched_job *drm_job)
>>> guc_read_stopped(guc), HZ * 5);
>>> if (!ret || guc_read_stopped(guc)) {
>>> trigger_reset:
>>> - if (!ret)
>>> - xe_gt_warn(guc_to_gt(guc), "Schedule disable failed
>>> to respond");
>>> + if (!ret) {
>>> + struct xe_gt *gt = guc_to_gt(guc);
>>> + struct drm_printer p = xe_gt_err_printer(gt);
>>> +
>>> + xe_gt_warn(gt, "%s schedule disable failed to
>>> respond guc_id=%d",
>>> + __func__, q->guc->id);
>>> + xe_guc_ct_print(&guc->ct, &p, true);
>>> + }
>>> set_exec_queue_extra_ref(q);
>>> xe_exec_queue_get(q); /* GT reset owns this */
>>> set_exec_queue_banned(q);
>>
More information about the Intel-xe
mailing list