[PATCH] drm/xe: Don't short circuit TDR on jobs not started

Matthew Brost matthew.brost at intel.com
Thu Oct 24 02:51:26 UTC 2024


Short circuiting TDR on jobs not started is an optimization which is not
required. On LNL we are facing an issue where jobs do not get scheduled
by the GuC if it misses a GGTT page update. When this occurs let the TDR
fire, toggle the scheduling which may get the job unstuck, and print a
warning message. If the TDR fires twice on job that hasn't started,
timeout the job.

v2:
 - Add warning message (Paulo)
 - Add fixes tag (Paulo)
 - Timeout job which hasn't started after TDR firing twice

Fixes: 7ddb9403dd74 ("drm/xe: Sample ctx timestamp to determine if jobs have timed out")
Cc: stable at vger.kernel.org
Cc: Paulo Zanoni <paulo.r.zanoni at intel.com>
Signed-off-by: Matthew Brost <matthew.brost at intel.com>
---
 drivers/gpu/drm/xe/xe_guc_submit.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c
index e5d7c767a744..6182d86a234c 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -941,12 +941,19 @@ static bool check_timeout(struct xe_exec_queue *q, struct xe_sched_job *job)
 	running_time_ms =
 		ADJUST_FIVE_PERCENT(xe_gt_clock_interval_to_ms(gt, diff));
 
-	xe_gt_dbg(gt,
-		  "Check job timeout: seqno=%u, lrc_seqno=%u, guc_id=%d, running_time_ms=%llu, timeout_ms=%u, diff=0x%08x",
-		  xe_sched_job_seqno(job), xe_sched_job_lrc_seqno(job),
-		  q->guc->id, running_time_ms, timeout_ms, diff);
-
-	return running_time_ms >= timeout_ms;
+	if (!xe_sched_job_started(job)) {
+		xe_gt_notice(gt,
+			     "Check job timeout: seqno=%u, lrc_seqno=%u, guc_id=%d, not started",
+			     xe_sched_job_seqno(job),
+			     xe_sched_job_lrc_seqno(job), q->guc->id);
+		return xe_sched_invalidate_job(job, 2);
+	} else {
+		xe_gt_dbg(gt,
+			  "Check job timeout: seqno=%u, lrc_seqno=%u, guc_id=%d, running_time_ms=%llu, timeout_ms=%u, diff=0x%08x",
+			  xe_sched_job_seqno(job), xe_sched_job_lrc_seqno(job),
+			  q->guc->id, running_time_ms, timeout_ms, diff);
+		return running_time_ms >= timeout_ms;
+	}
 }
 
 static void enable_scheduling(struct xe_exec_queue *q)
-- 
2.34.1



More information about the Intel-xe mailing list