[Intel-gfx] [PATCH] drm/i915/tgl: Magic udelay to relieve the random lockups with multiple engines

Sat Sep 28 10:01:45 UTC 2019

My current theory is that masks interrupt delivery to the local CPU
during a critical phase. Purely papering over the symptoms with a delay
plucked out of thin air from testing on tgl1-gem.

Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala at linux.intel.com>
Cc: Andi Shyti <andi.shyti at intel.com>
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index fa385218ce92..fe8f4625f04f 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1186,6 +1186,21 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
 	/* we need to manually load the submit queue */
 	if (execlists->ctrl_reg)
 		writel(EL_CTRL_LOAD, execlists->ctrl_reg);
+
+	/*
+	 * Now this is evil magic.
+	 *
+	 * Adding the same udelay() to process_csb before we clear
+	 * execlists->pending (that is after we receive the HW ack for this
+	 * submit and before we can submit again) does not relieve the symptoms
+	 * (machine lockup). So is the active difference here the wait under
+	 * the irq-off spinlock? That gives more credance to the theory that
+	 * the issue is interrupt delivery. Also note that we still rely on
+	 * disabling RPS, again that seems like an issue with simultaneous
+	 * GT interrupts being delivered to the same CPU.
+	 */
+	if (IS_TIGERLAKE(engine->i915))
+		udelay(250);
 }
 
 static bool ctx_single_port_submission(const struct intel_context *ce)
-- 
2.23.0