[Intel-gfx] [PATCH v3 11/14] HACK drm/i915/scheduler: emulate a scheduler for guc

Thu Dec 1 11:18:30 UTC 2016

On Thu, Dec 01, 2016 at 10:45:51AM +0000, Tvrtko Ursulin wrote:
> 
> On 14/11/2016 08:57, Chris Wilson wrote:
> >This emulates execlists on top of the GuC in order to defer submission of
> >requests to the hardware. This deferral allows time for high priority
> >requests to gazump their way to the head of the queue, however it nerfs
> >the GuC by converting it back into a simple execlist (where the CPU has
> >to wake up after every request to feed new commands into the GuC).
> 
> As it is starting to sink in we'll have to do add this hack sooner
> or later, review comments below.
> 
> Also, would you be OK to rebase this or would prefer to delegate it?

It's been very easy to keep it current.

> >@@ -665,6 +665,70 @@ static void i915_guc_submit(struct drm_i915_gem_request *rq)
> > 	spin_unlock(&client->wq_lock);
> > }
> 
> Confused me at first here until I noticed engine->submit_request
> will be the execlist_submit_request later. Perhaps it would be good
> to rename a lot of things now, like engine->request_queue,
> intel_engine_submit_request and maybe more?

Looking at the lifecycle of the request and getting clear names for the
phases and their vfuncs would be sensible. It may also be worthwhile to
decide that some are not engine tasks and belong to a new entity
(dev_priv->gt.X ?)

> >+static bool i915_guc_dequeue(struct intel_engine_cs *engine)
> >+{
> >+	struct execlist_port *port = engine->execlist_port;
> >+	struct drm_i915_gem_request *last = port[0].request;
> >+	unsigned long flags;
> >+	struct rb_node *rb;
> >+	bool submit = false;
> >+
> >+	spin_lock_irqsave(&engine->timeline->lock, flags);
> >+	rb = engine->execlist_first;
> >+	while (rb) {
> >+		struct drm_i915_gem_request *cursor =
> >+			rb_entry(rb, typeof(*cursor), priotree.node);
> >+
> >+		if (last && cursor->ctx != last->ctx) {
> 
> Not sure if GVT comes into the picture here, but it does not sounds
> like it would harm to use can_merge_ctx here?

I wasn't sure what path GVT would take either. So just went with the
simple version that looked as similar to the current guc submission as
possible. Also offloading the scheduling to the guc via semaphores will
likely make this whole chain look completely different.

> >+			if (port != engine->execlist_port)
> >+				break;
> 
> It may be an overkill for the first version, but I was thinking that
> we don't have to limit it to two at a time. And it would depend on
> measuring of course. But perhaps it would make sense to do the
> generalisation of the number of supported ports straight away.

Definitely. I was just looking at a minimal conversion, hence reusing
the existing tracking, and limits.

> We could theoretically share most of the execlist_dequeue and just
> do a couple things differently depending on the mode.

Yes, there were just a couple of intrusive warts that I felt justified
in keeping the routines separate. But mostly it was the merge_ctx
decision that looked to be backend specific.

> Looks like one could be a new engine->submit_ports vfunc. And there
> is also the lite restore WA and sw signalling to design in nicely,
> but it may be worth sharing the code. It would be renamed to
> sometihng like scheduler_dequeue or something.

> >diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> >index cb8a75f6ca16..18dce4c66d56 100644
> >--- a/drivers/gpu/drm/i915/i915_irq.c
> >+++ b/drivers/gpu/drm/i915/i915_irq.c
> >@@ -1341,8 +1341,10 @@ static void snb_gt_irq_handler(struct drm_i915_private *dev_priv,
> > static __always_inline void
> > gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift)
> > {
> >-	if (iir & (GT_RENDER_USER_INTERRUPT << test_shift))
> >+	if (iir & (GT_RENDER_USER_INTERRUPT << test_shift)) {
> >+		tasklet_schedule(&engine->irq_tasklet);
> 
> This would be better made conditional on GuC submission just to
> calling tasklet_schedule twice (occasionally) in execlist mode.

It's not a huge cost if we do schedule the tasklet twice (in the same
interrupt context at least). Otherwise, yes we are incurring more mmio
reads to determine that the CSB didn't advance. On the other hand, I
really don't want to have if (i915.enable_guc_submission) here. Maybe
if (engine->user_irq_tasklet) tasklet_schedule(engine->user_irq_tasklet)
?

Speaking of CSB, the spec keeps hinting that "CSB occupies a whole
cacheline, making the read cheap". I'm not aware of how we can do a UC
to cacheline read in a single transfer. Ideas?  movntdqa I thought was
specifically WC.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre