[RFC PATCH] drm/i915/gt: Force mcr lock takeover if hardware forgot to release it

Thu Sep 28 15:19:58 UTC 2023

While discussing with Nirmoy offline about this other way for
fixing lock contention, he was a bit sceptical about it.

But why not? We know that if we fall into this case it's because
some hardware component has forgotten to release the lock within
100ms. So that we have two possibilities, either bail out or
force the unlock.

Forcing the unlock might not be respectful to the environment,
but, at the end, i915 should have the highest priority.

Nirmoy's solution here[*] is to force the unlock during gt
resume, but what happens if meantime the hardware takes the lock
and doesn't release it?

Open for opinions or profligate rejections :-)

I'm also curious to see what CI has to say about.

[*] https://patchwork.freedesktop.org/series/124397/

Signed-off-by: Andi Shyti <andi.shyti at linux.intel.com>
Cc: Nirmoy Das <nirmoy.das at intel.com>
Cc: Matt Roper <matthew.d.roper at intel.com>
---
 drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 46 ++++++++++++++++----------
 1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
index bf4a933de03a..38833515b220 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
@@ -371,14 +371,34 @@ void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long *flags)
 
 	lockdep_assert_not_held(&gt->uncore->lock);
 
-	/*
-	 * Starting with MTL, we need to coordinate not only with other
-	 * driver threads, but also with hardware/firmware agents.  A dedicated
-	 * locking register is used.
-	 */
-	if (GRAPHICS_VER_FULL(gt->i915) >= IP_VER(12, 70))
-		err = wait_for(intel_uncore_read_fw(gt->uncore,
-						    MTL_STEER_SEMAPHORE) == 0x1, 100);
+	do {
+		/*
+		 * Starting with MTL, we need to coordinate not only with other
+		 * driver threads, but also with hardware/firmware agents.  A
+		 * dedicated locking register is used.
+		 */
+		if (GRAPHICS_VER_FULL(gt->i915) >= IP_VER(12, 70))
+			err = wait_for(intel_uncore_read_fw(gt->uncore,
+					      MTL_STEER_SEMAPHORE) == 0x1, 100);
+		else
+			break;
+
+		/*
+		 * In theory we should never fail to acquire the HW semaphore;
+		 * this would indicate some hardware/firmware is misbehaving and
+		 * not releasing it properly.
+		 */
+		if (err == -ETIMEDOUT) {
+			gt_warn(gt,
+				"hardware MCR steering semaphore timed out "
+				"forcing lock takeover\n");
+			/*
+			 * Force lock takeover
+			 */
+			intel_uncore_write_fw(gt->uncore,
+					      MTL_STEER_SEMAPHORE, 0x1);
+		}
+	} while (err == -ETIMEDOUT);
 
 	/*
 	 * Even on platforms with a hardware lock, we'll continue to grab
@@ -389,16 +409,6 @@ void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long *flags)
 	spin_lock_irqsave(&gt->mcr_lock, __flags);
 
 	*flags = __flags;
-
-	/*
-	 * In theory we should never fail to acquire the HW semaphore; this
-	 * would indicate some hardware/firmware is misbehaving and not
-	 * releasing it properly.
-	 */
-	if (err == -ETIMEDOUT) {
-		gt_err_ratelimited(gt, "hardware MCR steering semaphore timed out");
-		add_taint_for_CI(gt->i915, TAINT_WARN);  /* CI is now unreliable */
-	}
 }
 
 /**
-- 
2.40.1