[Intel-gfx] [PATCH v7] drm/i915/execlists: Move WA_TAIL_DWORDS to callee

Fri Aug 19 13:31:15 UTC 2016

Currently the execlist-specific emit-request functions start writing
to the ring and reserve space for a workaround to be emitted later
whilst submitting the request. It is easier to read if the caller only
allocates sufficient space for its own accesses (then the reader can
quickly verify that the ring begin allocates the exact space for the
number of dwords emitted) and closes the access to the ring. During
submit, if we need to add the workaround, we can reacquire ring access,
in the assurance that we reserved space for ourselves when beginning
the request.

v3:
    Reinstated #define for WA_TAIL_DWORDS so we could accommodate GPUs
    that required different amounts of padding, generalised NOOP fill
    [Rodrigo Vivi], added W/A space to reserved amount and updated
    code comments [Dave Gordon],

v4:
    Made #define for WA_TAIL_DWORDS a function-type macro in case we
    want different values on different platforms (or engines), to
    address comment by [Rodrigo Vivi]. But the current value is still
    the constant (2) on all platforms, so this doesn't add any overhead.

v5:
    Eliminated local variable & multiple indirections. Added long essay
    comment about WaIdleLiteRestore.

v6:
    Renamed #define for amount of space used by add_request() - it's the
    *maximum* length of opcodes emitted by that function, not a minimum,
    and made it (hypothetically) platform- or engine-dependent. But it's
    still the same literal-constant on all platforms for now.

v7:
    Rebased

Originally-by: Rodrigo Vivi <rodrigo.vivi at gmail.com>
Rewritten-by: Chris Wilson <chris at chris-wilson.co.uk>
Further-tweaked-by: Dave Gordon <david.s.gordon at intel.com>

Signed-off-by: Dave Gordon <david.s.gordon at intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi at gmail.com>
Cc: Chris Wilson <chris at chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_gem_request.c |  2 +-
 drivers/gpu/drm/i915/intel_lrc.c        | 92 ++++++++++++++++++++++++---------
 drivers/gpu/drm/i915/intel_ringbuffer.h | 19 ++++---
 3 files changed, 81 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 1a21532..f5b38dd 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -411,7 +411,7 @@ struct drm_i915_gem_request *
 	 * to be redone if the request is not actually submitted straight
 	 * away, e.g. because a GPU scheduler has deferred it.
 	 */
-	req->reserved_space = MIN_SPACE_FOR_ADD_REQUEST;
+	req->reserved_space = sizeof(u32) * MAX_ADD_REQUEST_DWORDS(req);
 
 	if (i915.enable_execlists)
 		ret = intel_logical_ring_alloc_request_extras(req);
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 6b49df4..60919cf 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -218,6 +218,17 @@ enum {
 #define GEN8_CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT	0x17
 #define GEN9_CTX_RCS_INDIRECT_CTX_OFFSET_DEFAULT	0x26
 
+/*
+ * Reserve space for some NOOPs at the end of each request, to be used by
+ * a workaround for not being allowed to do lite restore with HEAD==TAIL
+ * (WaIdleLiteRestore).
+ *
+ * The number of NOOPs is the same constant on all current platforms that
+ * require this, but in theory could be a platform- or engine- specific
+ * value based on the request.
+ */
+#define WA_TAIL_DWORDS(request)	(2)
+
 /* Typical size of the average request (2 pipecontrols and a MI_BB) */
 #define EXECLISTS_REQUEST_SIZE 64 /* bytes */
 
@@ -475,12 +486,27 @@ static void execlists_unqueue(struct intel_engine_cs *engine)
 
 	if (req0->elsp_submitted & engine->idle_lite_restore_wa) {
 		/*
-		 * WaIdleLiteRestore: make sure we never cause a lite restore
-		 * with HEAD==TAIL.
+		 * WaIdleLiteRestore: lite restore must not have HEAD==TAIL.
+		 *
+		 * If a request has previously been submitted (as req1) and
+		 * is now being /re/submitted (as req0), it may actually have
+		 * completed (with HEAD==TAIL), but we don't know that yet.
 		 *
-		 * Apply the wa NOOPS to prevent ring:HEAD == req:TAIL as we
-		 * resubmit the request. See gen8_emit_request() for where we
-		 * prepare the padding after the end of the request.
+		 * Unfortunately the hardware requires that we not submit
+		 * a context that is already idle with HEAD==TAIL; but we
+		 * cannot safely check this because there would always be
+		 * an opportunity for a race, where the context /becomes/
+		 * idle after we check and before resubmission.
+		 *
+		 * So instead we increment the request TAIL here to ensure
+		 * that it is different from the last value seen by the
+		 * hardware, in effect always adding extra work to be done
+		 * even if the context has already completed. That work
+		 * consists of NOOPs added by intel_logical_ring_submit()
+		 * after the end of each request. Advancing TAIL turns
+		 * those NOOPs into part of the current request; if not so
+		 * consumed, they remain in the ringbuffer as a harmless
+		 * prefix to the next request.
 		 */
 		req0->tail += 8;
 		req0->tail &= req0->ring->size - 1;
@@ -654,6 +680,14 @@ int intel_logical_ring_alloc_request_extras(struct drm_i915_gem_request *request
 	 */
 	request->reserved_space += EXECLISTS_REQUEST_SIZE;
 
+	/*
+	 * WA_TAIL_DWORDS is specific to the execlist submission mechanism,
+	 * to accommodate some NOOPs at the end of each request, to be used
+	 * by a workaround for not being allowed to do lite restore with
+	 * HEAD==TAIL (WaIdleLiteRestore). See intel_logical_ring_submit()
+	 */
+	request->reserved_space += sizeof(u32) * WA_TAIL_DWORDS(request);
+
 	if (!ce->state) {
 		ret = execlists_context_deferred_alloc(request->ctx, engine);
 		if (ret)
@@ -708,29 +742,44 @@ int intel_logical_ring_alloc_request_extras(struct drm_i915_gem_request *request
  * intel_logical_ring_advance() - advance the tail and prepare for submission
  * @request: Request to advance the logical ringbuffer of.
  *
- * The tail is updated in our logical ringbuffer struct, not in the actual context. What
- * really happens during submission is that the context and current tail will be placed
- * on a queue waiting for the ELSP to be ready to accept a new context submission. At that
- * point, the tail *inside* the context is updated and the ELSP written to.
+ * The tail is updated in our logical ringbuffer struct, not in the actual
+ * context. What really happens during submission is that the context and
+ * current tail will be placed on a queue waiting for the ELSP to be ready
+ * to accept a new context submission. At that point, the tail *inside* the
+ * context is updated and the ELSP written to by the submitting agent i.e.
+ * either the driver (in execlist mode), or the GuC (in GuC-submission mode).
  */
 static int
 intel_logical_ring_advance(struct drm_i915_gem_request *request)
 {
 	struct intel_ring *ring = request->ring;
 	struct intel_engine_cs *engine = request->engine;
+	int padding = WA_TAIL_DWORDS(request);
 
 	intel_ring_advance(ring);
 	request->tail = ring->tail;
 
 	/*
-	 * Here we add two extra NOOPs as padding to avoid
-	 * lite restore of a context with HEAD==TAIL.
-	 *
-	 * Caller must reserve WA_TAIL_DWORDS for us!
+	 * Fill in a few NOOPs after the end of the request proper,
+	 * as a buffer between requests to be used by a workaround
+	 * for not being allowed to do lite restore with HEAD==TAIL.
+	 * (WaIdleLiteRestore). These words may be consumed by the
+	 * submission mechanism if a context is *re*submitted while
+	 * (potentially) still active; otherwise, they will be left
+	 * as a harmless preamble to the next request.
 	 */
-	intel_ring_emit(ring, MI_NOOP);
-	intel_ring_emit(ring, MI_NOOP);
-	intel_ring_advance(ring);
+	if (padding > 0) {
+		/* Safe because we reserved the space earlier */
+		int ret = intel_ring_begin(request, padding);
+		if (WARN_ON(ret != 0))
+			return ret;
+
+		do
+			intel_ring_emit(ring, MI_NOOP);
+		while (--padding);
+
+		intel_ring_advance(ring);
+	}
 
 	/* We keep the previous context alive until we retire the following
 	 * request. This ensures that any the context object is still pinned
@@ -1587,19 +1636,12 @@ static void bxt_a_seqno_barrier(struct intel_engine_cs *engine)
 	intel_flush_status_page(engine, I915_GEM_HWS_INDEX);
 }
 
-/*
- * Reserve space for 2 NOOPs at the end of each request to be
- * used as a workaround for not being allowed to do lite
- * restore with HEAD==TAIL (WaIdleLiteRestore).
- */
-#define WA_TAIL_DWORDS 2
-
 static int gen8_emit_request(struct drm_i915_gem_request *request)
 {
 	struct intel_ring *ring = request->ring;
 	int ret;
 
-	ret = intel_ring_begin(request, 6 + WA_TAIL_DWORDS);
+	ret = intel_ring_begin(request, 6);
 	if (ret)
 		return ret;
 
@@ -1622,7 +1664,7 @@ static int gen8_emit_request_render(struct drm_i915_gem_request *request)
 	struct intel_ring *ring = request->ring;
 	int ret;
 
-	ret = intel_ring_begin(request, 8 + WA_TAIL_DWORDS);
+	ret = intel_ring_begin(request, 6+1+1);
 	if (ret)
 		return ret;
 
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 86612d5..9c4175c 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -509,13 +509,20 @@ static inline u32 intel_engine_get_seqno(struct intel_engine_cs *engine)
 int init_workarounds_ring(struct intel_engine_cs *engine);
 
 /*
- * Arbitrary size for largest possible 'add request' sequence. The code paths
- * are complex and variable. Empirical measurement shows that the worst case
- * is BDW at 192 bytes (6 + 6 + 36 dwords), then ILK at 136 bytes. However,
- * we need to allocate double the largest single packet within that emission
- * to account for tail wraparound (so 6 + 6 + 72 dwords for BDW).
+ * Size for the longest possible 'add_request' sequence, for the given request.
+ *
+ * In theory, this could depend on the platform, or the engine, or many other
+ * things; hence the macro takes the request as an argument on which to base
+ * this calculation.  In practive, the code paths are complex and variable,
+ * and it's simpler just to give an upper bound for all possibilities,
+ * ignoring the details of the actual request.
+ *
+ * Empirical measurement shows that the worst case is BDW at 192 bytes
+ * (6 + 6 + 36 dwords), then ILK at 136 bytes. However, we need to allocate
+ * double the largest single packet within that emissionto account for tail
+ * wraparound (so 6 + 6 + 72 dwords for BDW).
  */
-#define MIN_SPACE_FOR_ADD_REQUEST 336
+#define MAX_ADD_REQUEST_DWORDS(request)	(6+6+72)
 
 static inline u32 intel_hws_seqno_address(struct intel_engine_cs *engine)
 {
-- 
1.9.1