[Intel-gfx] [PATCH 2/3] drm/i915/scheduler: emulate a scheduler for guc

Wed Jan 11 17:28:30 UTC 2017

On Wed, Jan 11, 2017 at 04:55:46PM +0000, Tvrtko Ursulin wrote:
> 
> On 11/01/2017 13:13, Chris Wilson wrote:
> >This emulates execlists on top of the GuC in order to defer submission of
>  > requests to the hardware. This deferral allows time for high priority
> >requests to gazump their way to the head of the queue, however it nerfs
> >the GuC by converting it back into a simple execlist (where the CPU has
> >to wake up after every request to feed new commands into the GuC).
> >
> >v2: Drop hack status - though iirc there is still a lockdep inversion
> >between fence and engine->timeline->lock (which is impossible as the
> >nesting only occurs on different fences - hopefully just requires some
> >judicious lockdep annotation)
> 
> Hm hm... so fence->lock under timeline->lock when we enable
> signalling, while we already have the opposite in the
> submit_notify->submit_request yes? That would mean a per fence lock
> class, which can't work, or a nested annotation on fence->lock? So
> outside i915?

That was my approach as well, and by heading in that direction, we can
see that it is the same nested signal enabling issue we met when
handling the deferral of the breadcrumb signaler in
i915_gem_request_submit().

> >Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
> >---
> > drivers/gpu/drm/i915/i915_guc_submission.c | 79 +++++++++++++++++++++++++++---
> > drivers/gpu/drm/i915/i915_irq.c            |  4 +-
> > drivers/gpu/drm/i915/intel_lrc.c           |  5 +-
> > 3 files changed, 76 insertions(+), 12 deletions(-)
> >
> >diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
> >index 913d87358972..bdc9e2bc5eb9 100644
> >--- a/drivers/gpu/drm/i915/i915_guc_submission.c
> >+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
> >@@ -350,7 +350,7 @@ int i915_guc_wq_reserve(struct drm_i915_gem_request *request)
> > 	u32 freespace;
> > 	int ret;
> >
> >-	spin_lock(&client->wq_lock);
> >+	spin_lock_irq(&client->wq_lock);
> > 	freespace = CIRC_SPACE(client->wq_tail, desc->head, client->wq_size);
> > 	freespace -= client->wq_rsvd;
> > 	if (likely(freespace >= wqi_size)) {
> >@@ -360,7 +360,7 @@ int i915_guc_wq_reserve(struct drm_i915_gem_request *request)
> > 		client->no_wq_space++;
> > 		ret = -EAGAIN;
> > 	}
> >-	spin_unlock(&client->wq_lock);
> >+	spin_unlock_irq(&client->wq_lock);
> >
> > 	return ret;
> > }
> >@@ -372,9 +372,9 @@ void i915_guc_wq_unreserve(struct drm_i915_gem_request *request)
> >
> > 	GEM_BUG_ON(READ_ONCE(client->wq_rsvd) < wqi_size);
> >
> >-	spin_lock(&client->wq_lock);
> >+	spin_lock_irq(&client->wq_lock);
> > 	client->wq_rsvd -= wqi_size;
> >-	spin_unlock(&client->wq_lock);
> >+	spin_unlock_irq(&client->wq_lock);
> > }
> >
> > /* Construct a Work Item and append it to the GuC's Work Queue */
> >@@ -534,10 +534,74 @@ static void __i915_guc_submit(struct drm_i915_gem_request *rq)
> >
> > static void i915_guc_submit(struct drm_i915_gem_request *rq)
> > {
> >-	i915_gem_request_submit(rq);
> >+	__i915_gem_request_submit(rq);
> > 	__i915_guc_submit(rq);
> > }
> >
> >+static bool i915_guc_dequeue(struct intel_engine_cs *engine)
> >+{
> >+	struct execlist_port *port = engine->execlist_port;
> >+	struct drm_i915_gem_request *last = port[0].request;
> >+	unsigned long flags;
> >+	struct rb_node *rb;
> >+	bool submit = false;
> >+
> >+	spin_lock_irqsave(&engine->timeline->lock, flags);
> >+	rb = engine->execlist_first;
> >+	while (rb) {
> >+		struct drm_i915_gem_request *cursor =
> >+			rb_entry(rb, typeof(*cursor), priotree.node);
> >+
> >+		if (last && cursor->ctx != last->ctx) {
> >+			if (port != engine->execlist_port)
> >+				break;
> >+
> >+			i915_gem_request_assign(&port->request, last);
> >+			dma_fence_enable_sw_signaling(&last->fence);
> >+			port++;
> >+		}
> >+
> >+		rb = rb_next(rb);
> >+		rb_erase(&cursor->priotree.node, &engine->execlist_queue);
> >+		RB_CLEAR_NODE(&cursor->priotree.node);
> >+		cursor->priotree.priority = INT_MAX;
> >+
> >+		i915_guc_submit(cursor);
> >+		last = cursor;
> >+		submit = true;
> >+	}
> >+	if (submit) {
> >+		i915_gem_request_assign(&port->request, last);
> >+		dma_fence_enable_sw_signaling(&last->fence);
> >+		engine->execlist_first = rb;
> >+	}
> >+	spin_unlock_irqrestore(&engine->timeline->lock, flags);
> >+
> >+	return submit;
> >+}
> 
> It is again tempting me to suggest a single instance of the dequeue
> loop. Maybe something like:
> 
> i915_request_dequeue(engine, assign_func, submit_func)
> {
> 	...
> 	
> 	while (rb) {
> 		...
> 		if () {

Don't forget

if (!cant_merge_func()) { /* which is where I start not liking this helper */
	if (must_break_func())
		break;

> 		...
> 			assign_func();
> 			port++;
> 		}
> 		
> 		...
> 
> 		submit_func();
> 		last = cursor;
> 		submit = true;
> 	}
> 	if (submit) {
> 		assign_func();
> 		engine->execlists_first = rb;
> 	}
> 
> 	...
> }

[snip]

> execlists_dequeue()
> {
> 	...
> 
> 	submit = i915_request_deqeue(engine,
> 				     execlists_dequeue_assign,
> 				     execlists_deqeue_submit);
> 	if (submit)
> 		execlists_submit_ports(engine);
> }

I'm still hesistating because I don't think this the direction guc
submission will take (I can hope for a true hw scheduler!) and I worry
about whether it will make the next step less clear. My crystal ball is
cloudy. But yes there is too much duplication here to ignore completely.

> >+static void i915_guc_irq_handler(unsigned long data)
> >+{
> >+	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
> >+	struct execlist_port *port = engine->execlist_port;
> >+	struct drm_i915_gem_request *rq;
> >+	bool submit;
> >+
> >+	do {
> >+		rq = port[0].request;
> >+		while (rq && i915_gem_request_completed(rq)) {
> >+			i915_gem_request_put(rq);
> >+			rq = port[1].request;
> >+			port[0].request = rq;
> >+			port[1].request = NULL;
> >+		}
> >+
> >+		submit = false;
> >+		if (!port[1].request)
> >+			submit = i915_guc_dequeue(engine);
> >+	} while (submit);
> 
> Outer loop looks to be not needed since the i915_guc_dequeue will
> fill both ports if it can. Or it is for the unlikely case a new
> request manages to get in after one and only has been submitted?

Yup, just paranoia to not return too soon if we can help it. It is not
needed since the irq will fire shortly (worst case) and we will be
handling the interrupt anyway. The loop is redundant, and yes it is
probably clearer without, but worst case is that we may leave the GPU
idle for a short period.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre