[PATCH 2/3] drm/xe/guc/ct: Increase wait timeout for g2h response

Thu Oct 17 09:54:13 UTC 2024

On 2024-10-14 at 17:42:29 +0530, Nilawar, Badal wrote:
> Hi Matt, John,
> 
> Thanks for review comments.
> 
> On 11-10-2024 04:36, Matthew Brost wrote:
> > On Wed, Oct 09, 2024 at 12:43:36PM -0700, John Harrison wrote:
> > > On 10/9/2024 03:56, Badal Nilawar wrote:
> > > > Occasionally, the G2H worker starts running after a delay of more than
> > > > a second even after being queued and activated by the Linux workqueue
> > > > subsystem.
> > > > To prevent G2H timeout errors, the wait timeout is being increased.
> > > > 
> > > > Closes: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1620
> > > > Closes: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2902
> > > > Signed-off-by: Badal Nilawar <badal.nilawar at intel.com>
> > > > Cc: Matthew Brost <matthew.brost at intel.com>
> > > > Cc: Matthew Auld <matthew.auld at intel.com>
> > > > Cc: John Harrison <John.C.Harrison at Intel.com>
> > > > ---
> > > >    drivers/gpu/drm/xe/xe_guc_ct.c | 2 +-
> > > >    1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
> > > > index b93b2821e4e8..dcc95c01b6f0 100644
> > > > --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> > > > +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> > > > @@ -1019,7 +1019,7 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
> > > >    		return ret;
> > > >    	}
> > > > -	ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
> > > > +	ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ * 3);
> > > Is this change intended to be temporary until the fundamental scheduling
> > > issue with the workqueue is fixed? If so, there should be a TODO comment to
> > > that effect so that we remember to shrink the timeout back down again later.
> > > Three seconds seems like a long time to wait.
> > > 
> > 
> > I fine with this W/A until we root cause the work queue scheduling issue
> > but agree this needs a comment explaining why this large timeout is
> > needed (work queue scheduling issue), how to trigger the larger timeout
> > (tests which can trigger this), and saying once we root cause issue
> > reduce the timeout.
The root cause of this issue is lies with scheduling latency of lnl hybrid cpu.
This is beyond XeKMD, and issue is disappears if we disables lnl atom cpus from BIOS.
Agree to add a TODO comment here to remove this WA, once lnl hybrid cpu fixes this 
workqueue scheduling issue. 
But why do we need to explain the tests and reproduction in the code comment ?
if needed that can be added in cover letter.

Thanks,
Anshuman

> 
> Sure, I will add the comment here and in patch 3 to explain why this is
> needed and change need to be reverted once this is fixed.
> 
> Regards,
> Badal
> 
> > 
> > Matt
> > 
> > > John.
> > > 
> > > >    	/*
> > > >    	 * It is possible that the g2h request may be cancelled while waiting for a response due
> > > 
>