[PATCH] drm/xe/guc: In guc_ct_send_recv flush g2h worker if g2h resp times out
Matthew Brost
matthew.brost at intel.com
Wed Oct 2 14:04:43 UTC 2024
On Tue, Oct 01, 2024 at 01:41:15PM +0530, Nilawar, Badal wrote:
>
>
> On 28-09-2024 02:57, Matthew Brost wrote:
> > On Sat, Sep 28, 2024 at 12:54:28AM +0530, Badal Nilawar wrote:
> > > It is observed that for GuC CT request G2H IRQ triggered and g2h_worker
> > > queued, but it didn't get opportunity to execute and timeout occurred.
> > > To address this the g2h_worker is being flushed.
> > >
> > > Cc: John Harrison <John.C.Harrison at Intel.com>
> > > Signed-off-by: Badal Nilawar <badal.nilawar at intel.com>
> > > ---
> > > drivers/gpu/drm/xe/xe_guc_ct.c | 11 +++++++++++
> > > 1 file changed, 11 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> > > index 4b95f75b1546..4a5d7f85d1a0 100644
> > > --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> > > +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> > > @@ -903,6 +903,17 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
> > > }
> > > ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
> > > +
> > > + /*
> > > + * It is observed that for above GuC CT request G2H IRQ triggered
> >
> > Where is this observed. 1 second is a long to wait for a worker...
>
> Please see this log.
>
Logs are good but explaining the test case is also helpful so I don't
have reverse engineer things. Also having platform information would be
helpful too. So what is the test case here and what platform?
> [ 176.602482] xe 0000:00:02.0: [drm:xe_guc_pc_get_min_freq [xe]] GT0: GT[0]
> GuC PC status query
> [ 176.603019] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
> GT[0]
> [ 176.603449] xe 0000:00:02.0: [drm:g2h_worker_func [xe]] GT0: G2H work
> running GT[0]
> [ 176.604379] xe 0000:00:02.0: [drm:xe_guc_pc_get_max_freq [xe]] GT0: GT[0]
> GuC PC status query
> [ 176.605464] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
> GT[0]
> [ 176.605821] xe 0000:00:02.0: [drm:g2h_worker_func [xe]] GT0: G2H work
> running GT[0]
> [ 176.716699] xe 0000:00:02.0: [drm] GT0: trying reset
This looks we are doing a GT reset and this is causing problems. This
patch is likely papering over an issue with our GT flows. So this patch
doesn't seem correct to me. Let's try to figure what is going wrong in
the reset flow.
> [ 176.716718] xe 0000:00:02.0: [drm] GT0: GuC PC status query //GuC PC
> check request
> [ 176.717648] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
> GT[0] // IRQ
> [ 177.728637] xe 0000:00:02.0: [drm] *ERROR* GT0: Timed out wait for G2H,
> fence 1311, action 3003 //Timeout
> [ 177.737637] xe 0000:00:02.0: [drm] *ERROR* GT0: GuC PC query task state
> failed: -ETIME
> [ 177.745644] xe 0000:00:02.0: [drm] GT0: reset queued
Here this is almost 1 second after 'trying reset' which I'm unsure how
that could happen looking at the source code upstream.
'xe_uc_reset_prepare' is called between 'trying reset' and 'reset
queued' but that doesn't wait anywhere rather resolves to the below
function:
1769 int xe_guc_submit_reset_prepare(struct xe_guc *guc)
1770 {
1771 int ret;
1772
1773 /*
1774 * Using an atomic here rather than submission_state.lock as this
1775 * function can be called while holding the CT lock (engine reset
1776 * failure). submission_state.lock needs the CT lock to resubmit jobs.
1777 * Atomic is not ideal, but it works to prevent against concurrent reset
1778 * and releasing any TDRs waiting on guc->submission_state.stopped.
1779 */
1780 ret = atomic_fetch_or(1, &guc->submission_state.stopped);
1781 smp_wmb();
1782 wake_up_all(&guc->ct.wq);
1783
1784 return ret;
1785 }
If this log from an internal repo or something? This looks like some
sort of circular dependency where a GT reset starts and the G2H handler
doesn't get queued because the CT channel is disabled, the G2H times
out, and reset stalls waiting for the timeout.
> [ 177.849081] xe 0000:00:02.0: [drm:xe_guc_pc_get_min_freq [xe]] GT0: GT[0]
> GuC PC status query
> [ 177.849659] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
> GT[0]
> [ 178.632672] xe 0000:00:02.0: [drm] GT0: reset started
> [ 178.632639] xe 0000:00:02.0: [drm:g2h_worker_func [xe]] GT0: G2H work
> running GT[0] // Worker ran
> [ 178.632897] xe 0000:00:02.0: [drm] GT0: G2H fence (1311) not found!
>
> >
> > > + * and g2h_worker queued, but it didn't get opportunity to execute
> > > + * and timeout occurred. To address the g2h_worker is being flushed.
> > > + */
> > > + if (!ret) {
> > > + flush_work(&ct->g2h_worker);
> > > + ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
> >
> > If this is needed I wouldn't wait 1 second, if the flush worked
> > 'g2h_fence.done' should immediately be signaled. Maybe wait 1 MS?
>
> In config HZ is set to 250, which is 4 ms I think.
>
HZ should always be one second [1].
[1] https://www.oreilly.com/library/view/linux-device-drivers/9781785280009/4041820a-bbe4-4502-8ef9-d1913e133332.xhtml#:~:text=In%20other%20words%2C%20HZ%20represents,incremented%20HZ%20times%20every%20second.
> CONFIG_HZ_250=y
> # CONFIG_HZ_300 is not set
> # CONFIG_HZ_1000 is not set
> CONFIG_HZ=250
>
I'm little confused how this Kconfig works [2] but I don't think
actually changes the time of HZ rather it changes how many jiffies are
in one second.
[2] https://lwn.net/Articles/56378/
Matt
> Regards,
> Badal
>
> >
> > Matt
> >
> > > + }
> > > +
> > > if (!ret) {
> > > xe_gt_err(gt, "Timed out wait for G2H, fence %u, action %04x",
> > > g2h_fence.seqno, action[0]);
> > > --
> > > 2.34.1
> > >
>
More information about the Intel-xe
mailing list