[Intel-gfx] [PATCH 19/20] drm/i915: drm/i915 changes to simulated hangs

Thu Oct 22 18:32:41 PDT 2015

From: Tim Gore <tim.gore at intel.com>

Simulated hangs, as used by drv_hangman and some other IGT tests, are not
handled correctly with the new per-engine hang recovery mode. This patch fixes
several issues needed to get them working in the execlist case.

1) The "simulated" hang is effected by not submitting a particular batch buffer
   to the hardware. In this way it is not handled by the hardware are hence
   remains in the software queue, leading the TDR mechanism to declare a hang.
   The place where the submission of the batch was being blocked was in
   intel_logical_ring_advance_and_submit. Because this means the request never
   enters the execlist_queue, the TDR mechanism does not detect a hang and the
   situation is never cleared. Also, blocking the batch buffer here is before
   the intel_ctx_submit_request object gets allocated. During TDR we need to
   actually complete the submission process to unhang the ring, but we are not
   in user context so cannot allocate the request object.  To overcome both
   these issues I moved the place where submission is blocked to
   execlists_context_unqueue. This means that the request enters the
   ring->execlist_queue, so the TDR mechanism detects the hang and can resubmit
   the request after the stop_rings bit is cleared.

2) A further problem arises from a workaround in i915_hangcheck_sample to deal
   with a context submission status of "...INCONSISTENT" being reported by
   intel_execlists_TDR_get_current_request() when the hardware is idle.  A
   simulated hang, because it causes the sw and hw context id's to be out of
   sync, results in a context submission status of "...INCONSISTENT" being
   reported, triggering this workaround which resubmits the batch and clears
   the hang, avoiding a ring reset. But we want the ring reset to occur, since
   this is part of what we are testing. So, I have made
   intel_execlists_TDR_get_current_request() aware of simulated hangs, so that
   it returns a status of OK in this case.  This avoids the workaround being
   triggered, leading to the TDR mechanism declaring a ring hang and doing a
   ring reset.

Issue: VIZ-5488
Signed-off-by: Tim Gore <tim.gore at intel.com>
Signed-off-by: Tomas Elf <tomas.elf at intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c  | 10 ++++++++++
 drivers/gpu/drm/i915/intel_lrc.c | 26 +++++++++++++++++++++-----
 2 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index d482341..ecb3eb1 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1175,6 +1175,16 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 		 }
 	}
 
+	/* Clear any simulated hang flags */
+	if (dev_priv->gpu_error.stop_rings) {
+		DRM_INFO("Simulated gpu hang, reset stop_rings bits %08x\n",
+			(0x1 << engine->id));
+		dev_priv->gpu_error.stop_rings &= ~(0x1 << engine->id);
+		/* if all hangs are cleared, then clear the ALLOW_BAN/ERROR bits */
+		if ((dev_priv->gpu_error.stop_rings & ((1 << I915_NUM_RINGS) - 1)) == 0)
+			dev_priv->gpu_error.stop_rings = 0;
+	}
+
 	/* Sample the current ring head position */
 	head = I915_READ_HEAD(engine) & HEAD_ADDR;
 
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 96f5a28..284baab 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -641,6 +641,16 @@ static void execlists_context_unqueue(struct intel_engine_cs *ring, bool tdr_res
 		}
 	}
 
+	/* Check for a simulated hang request */
+	if (intel_ring_stopped(ring)) {
+		/*
+		 * Mark the request at the head of the queue as submitted but
+		 * dont actually submit it.
+		 */
+		req0->elsp_submitted++;
+		return;
+	}
+
 	WARN_ON(req1 && req1->elsp_submitted && !tdr_resubmission);
 
 	execlists_submit_requests(req0, req1, tdr_resubmission);
@@ -1006,16 +1016,12 @@ static int logical_ring_wait_for_space(struct drm_i915_gem_request *req,
 static void
 intel_logical_ring_advance_and_submit(struct drm_i915_gem_request *request)
 {
-	struct intel_engine_cs *ring = request->ring;
 	struct drm_i915_private *dev_priv = request->i915;
 
 	intel_logical_ring_advance(request->ringbuf);
 
 	request->tail = request->ringbuf->tail;
 
-	if (intel_ring_stopped(ring))
-		return;
-
 	if (dev_priv->guc.execbuf_client)
 		i915_guc_submit(dev_priv->guc.execbuf_client, request);
 	else
@@ -3250,7 +3256,17 @@ intel_execlists_TDR_get_current_request(struct intel_engine_cs *ring,
 	}
 
 	if (tmpctx) {
-		status = ((hw_context == sw_context) && hw_active) ?
+		/*
+		 * Check for simuated hang. In this case the head entry in the
+		 * sw execlist queue will not have been submitted to the ELSP, so
+		 * the hw and sw context id's may well disagree, but we still want
+		 * to proceed with hang recovery. So we return OK which allows
+		 * the TDR recovery mechanism to proceed with a ring reset.
+		 */
+		if (intel_ring_stopped(ring))
+			status = CONTEXT_SUBMISSION_STATUS_OK;
+		else
+			status = ((hw_context == sw_context) && hw_active) ?
 				CONTEXT_SUBMISSION_STATUS_OK :
 				CONTEXT_SUBMISSION_STATUS_INCONSISTENT;
 	} else {
-- 
1.9.1