[PATCH v2 1/2] drm/xe/guc/ct: Increase wait timeout for g2h response

Wed Oct 16 11:52:55 UTC 2024

Occasionally, the G2H worker starts running after a delay of more than
a second even after being queued and activated by the Linux workqueue
subsystem.
To prevent G2H timeout errors, the wait timeout is being increased.

v2: Add comment to describe this change with TODO (Matt B/John H)

Closes: https://gitlab.freedesktop.org/drm/xe/kernel/issues/1620
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/issues/2902
Signed-off-by: Badal Nilawar <badal.nilawar at intel.com>
Cc: Matthew Brost <matthew.brost at intel.com>
Cc: Matthew Auld <matthew.auld at intel.com>
Cc: John Harrison <John.C.Harrison at Intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray at intel.com>
---
 drivers/gpu/drm/xe/xe_guc_ct.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c
index c7673f56d413..3096baa4c9f4 100644
--- a/drivers/gpu/drm/xe/xe_guc_ct.c
+++ b/drivers/gpu/drm/xe/xe_guc_ct.c
@@ -1016,7 +1016,17 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,
 		return ret;
 	}
 
-	ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
+	/*
+	 * Occasionally it is seen that the G2H worker starts running after a delay of more than
+	 * a second even after being queued and activated by the Linux workqueue subsystem. This
+	 * leads to G2H timeout error. This is seen especially while running xe_pm and gt reset
+	 * flow which uses xe_guc_ct_send_recv(). To prevent G2H timeout errors, the wait timeout
+	 * is being increased.
+	 *
+	 * TODO: Reduce the timeout Once workqueue scheduling delay issue root caused and fixed.
+	 */
+
+	ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ * 3);
 
 	/*
 	 * Ensure we serialize with completion side to prevent UAF with fence going out of scope on
-- 
2.34.1