[PATCH] drm/xe/guc: In guc_ct_send_recv flush g2h worker if g2h resp times out

Fri Oct 4 06:50:15 UTC 2024

On Thu, Oct 03, 2024 at 03:24:12PM +0530, Nilawar, Badal wrote:
> 
> 
> On 02-10-2024 19:34, Matthew Brost wrote:
> > On Tue, Oct 01, 2024 at 01:41:15PM +0530, Nilawar, Badal wrote:
> > > 
> > > 
> > > On 28-09-2024 02:57, Matthew Brost wrote:
> > > > On Sat, Sep 28, 2024 at 12:54:28AM +0530, Badal Nilawar wrote:
> > > > > It is observed that for GuC CT request G2H IRQ triggered and g2h_worker
> > > > > queued, but it didn't get opportunity to execute and timeout occurred.
> > > > > To address this the g2h_worker is being flushed.
> > > > > 
> > > > > Cc: John Harrison <John.C.Harrison at Intel.com>
> > > > > Signed-off-by: Badal Nilawar <badal.nilawar at intel.com>
> > > > > ---
> > > > >    drivers/gpu/drm/xe/xe_guc_ct.c | 11 +++++++++++
> > > > >    1 file changed, 11 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> > > > > index 4b95f75b1546..4a5d7f85d1a0 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> > > > > @@ -903,6 +903,17 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
> > > > >    	}
> > > > >    	ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
> > > > > +
> > > > > +	/*
> > > > > +	 * It is observed that for above GuC CT request G2H IRQ triggered
> > > > 
> > > > Where is this observed. 1 second is a long to wait for a worker...
> > > 
> > > Please see this log.
> > > 
> > 
> > Logs are good but explaining the test case is also helpful so I don't
> > have reverse engineer things. Also having platform information would be
> > helpful too. So what is the test case here and what platform?
> 
> Sorry, my bad, I should have added issue id in commit message.
> https://gitlab.freedesktop.org/drm/xe/kernel/issues/1620.
> 
> This issue is reported on LNL for xe_gt_freq at freq_reset_multiple test and
> xe_pm@* tests during resume flow.
> 
> > > [  176.602482] xe 0000:00:02.0: [drm:xe_guc_pc_get_min_freq [xe]] GT0: GT[0]
> > > GuC PC status query
> > > [  176.603019] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
> > > GT[0]
> > > [  176.603449] xe 0000:00:02.0: [drm:g2h_worker_func [xe]] GT0: G2H work
> > > running GT[0]
> > > [  176.604379] xe 0000:00:02.0: [drm:xe_guc_pc_get_max_freq [xe]] GT0: GT[0]
> > > GuC PC status query
> > > [  176.605464] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
> > > GT[0]
> > > [  176.605821] xe 0000:00:02.0: [drm:g2h_worker_func [xe]] GT0: G2H work
> > > running GT[0]
> > > [  176.716699] xe 0000:00:02.0: [drm] GT0: trying reset
> > 
> > This looks we are doing a GT reset and this is causing problems. This
> > patch is likely papering over an issue with our GT flows. So this patch
> > doesn't seem correct to me. Let's try to figure what is going wrong in
> > the reset flow.
> 
> This is seen for slpc query after "reset done" as well.
> 
> > > [  176.716718] xe 0000:00:02.0: [drm] GT0: GuC PC status query    //GuC PC
> > > check request
> > > [  176.717648] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
> > > GT[0]  // IRQ
> > > [  177.728637] xe 0000:00:02.0: [drm] *ERROR* GT0: Timed out wait for G2H,
> > > fence 1311, action 3003  //Timeout
> > > [  177.737637] xe 0000:00:02.0: [drm] *ERROR* GT0: GuC PC query task state
> > > failed: -ETIME
> > > [  177.745644] xe 0000:00:02.0: [drm] GT0: reset queued
> > 
> > Here this is almost 1 second after 'trying reset' which I'm unsure how
> > that could happen looking at the source code upstream.
> > 'xe_uc_reset_prepare' is called between 'trying reset' and 'reset
> > queued' but that doesn't wait anywhere rather resolves to the below
> > function:
> > 
> > 1769 int xe_guc_submit_reset_prepare(struct xe_guc *guc)
> > 1770 {
> > 1771         int ret;
> > 1772
> > 1773         /*
> > 1774          * Using an atomic here rather than submission_state.lock as this
> > 1775          * function can be called while holding the CT lock (engine reset
> > 1776          * failure). submission_state.lock needs the CT lock to resubmit jobs.
> > 1777          * Atomic is not ideal, but it works to prevent against concurrent reset
> > 1778          * and releasing any TDRs waiting on guc->submission_state.stopped.
> > 1779          */
> > 1780         ret = atomic_fetch_or(1, &guc->submission_state.stopped);
> > 1781         smp_wmb();
> > 1782         wake_up_all(&guc->ct.wq);
> > 1783
> > 1784         return ret;
> > 1785 }
> 
> And CT is not disabled yet, so SLPC query will go through.
> 

I agree CT should not be disabled at this point.

> > 
> > If this log from an internal repo or something? This looks like some
> > sort of circular dependency where a GT reset starts and the G2H handler
> > doesn't get queued because the CT channel is disabled, the G2H times
> > out, and reset stalls waiting for the timeout.
> 
> This log is captured on LNL, with debug prints added, by running
> xe_gt_freq at freq_reset_multiple.
> 
> If CT channel is disabled then we will not see "G2H fence (1311) not
> found!".
> 
> During xe pm resume flow this is seen during guc_pc_start->pc_init_freqs().
>

Ok, was spitballing ideas - if this is upstream then the CT should be
alive but somehow it appears the worker to process CT is getting
staled.

Also it very suspect the time gap between 'trying reset' and 'reset
queued'.

This patch doesn't look like the solution. Can you look into hints I've
given here?

Matt

> 
> > 
> > > [  177.849081] xe 0000:00:02.0: [drm:xe_guc_pc_get_min_freq [xe]] GT0: GT[0]
> > > GuC PC status query
> > > [  177.849659] xe 0000:00:02.0: [drm:xe_guc_irq_handler [xe]] GT0: G2H IRQ
> > > GT[0]
> > > [  178.632672] xe 0000:00:02.0: [drm] GT0: reset started
> > > [  178.632639] xe 0000:00:02.0: [drm:g2h_worker_func [xe]] GT0: G2H work
> > > running GT[0] // Worker ran
> > > [  178.632897] xe 0000:00:02.0: [drm] GT0: G2H fence (1311) not found!
> > > 
> > > > 
> > > > > +	 * and g2h_worker queued, but it didn't get opportunity to execute
> > > > > +	 * and timeout occurred. To address the g2h_worker is being flushed.
> > > > > +	 */
> > > > > +	if (!ret) {
> > > > > +		flush_work(&ct->g2h_worker);
> > > > > +		ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
> > > > 
> > > > If this is needed I wouldn't wait 1 second, if the flush worked
> > > > 'g2h_fence.done' should immediately be signaled. Maybe wait 1 MS?
> > > 
> > > In config HZ is set to 250, which is 4 ms I think.
> > > 
> > 
> > HZ should always be one second [1].
> > 
> > [1] https://www.oreilly.com/library/view/linux-device-drivers/9781785280009/4041820a-bbe4-4502-8ef9-d1913e133332.xhtml#:~:text=In%20other%20words%2C%20HZ%20represents,incremented%20HZ%20times%20every%20second.
> > 
> > > CONFIG_HZ_250=y
> > > # CONFIG_HZ_300 is not set
> > > # CONFIG_HZ_1000 is not set
> > > CONFIG_HZ=250
> > > 
> > 
> > I'm little confused how this Kconfig works [2] but I don't think
> > actually changes the time of HZ rather it changes how many jiffies are
> > in one second.
> > 
> > [2] https://lwn.net/Articles/56378/
> 
> Oh ok, Thanks for clarification.
> 
> Regards,
> Badal
> 
> > 
> > Matt
> > 
> > > Regards,
> > > Badal
> > > 
> > > > 
> > > > Matt
> > > > 
> > > > > +	}
> > > > > +
> > > > >    	if (!ret) {
> > > > >    		xe_gt_err(gt, "Timed out wait for G2H, fence %u, action %04x",
> > > > >    			  g2h_fence.seqno, action[0]);
> > > > > -- 
> > > > > 2.34.1
> > > > > 
> > > 
>