[Intel-gfx] [PATCH 15/27] drm/i915: Split execlist priority queue into rbtree + linked list

Mon Apr 24 11:07:47 UTC 2017

On Mon, Apr 24, 2017 at 11:28:32AM +0100, Tvrtko Ursulin wrote:
> 
> On 19/04/2017 10:41, Chris Wilson wrote:
> >All the requests at the same priority are executed in FIFO order. They
> >do not need to be stored in the rbtree themselves, as they are a simple
> >list within a level. If we move the requests at one priority into a list,
> >we can then reduce the rbtree to the set of priorities. This should keep
> >the height of the rbtree small, as the number of active priorities can not
> >exceed the number of active requests and should be typically only a few.
> >
> >Currently, we have ~2k possible different priority levels, that may
> >increase to allow even more fine grained selection. Allocating those in
> >advance seems a waste (and may be impossible), so we opt for allocating
> >upon first use, and freeing after its requests are depleted. To avoid
> >the possibility of an allocation failure causing us to lose a request,
> >we preallocate the default priority (0) and bump any request to that
> >priority if we fail to allocate it the appropriate plist. Having a
> >request (that is ready to run, so not leading to corruption) execute
> >out-of-order is better than leaking the request (and its dependency
> >tree) entirely.
> >
> >There should be a benefit to reducing execlists_dequeue() to principally
> >using a simple list (and reducing the frequency of both rbtree iteration
> >and balancing on erase) but for typical workloads, request coalescing
> >should be small enough that we don't notice any change. The main gain is
> >from improving PI calls to schedule, and the explicit list within a
> >level should make request unwinding simpler (we just need to insert at
> >the head of the list rather than the tail and not have to make the
> >rbtree search more complicated).
> 
> Sounds attractive! What workloads show the benefit and how much?

The default will show the best, since everything is priority 0 more or
less and so we reduce the rbtree search to a single lookup and list_add.
It's hard to measure the impact of the rbtree though. On the dequeue
side, the mmio access dominates. On the schedule side, if we have lots
of requests, the dfs dominates.

I have an idea on how we might stress the rbtree in submit_request - but
still it requires long queues untypical of most workloads. Still tbd.

> >-static bool insert_request(struct i915_priotree *pt, struct rb_root *root)
> >+static bool
> >+insert_request(struct intel_engine_cs *engine,
> >+	       struct i915_priotree *pt,
> >+	       int prio)
> > {
> >+	struct execlist_priolist *plist;
> > 	struct rb_node **p, *rb;
> > 	bool first = true;
> >
> >+find_plist:
> > 	/* most positive priority is scheduled first, equal priorities fifo */
> > 	rb = NULL;
> >-	p = &root->rb_node;
> >+	p = &engine->execlist_queue.rb_node;
> > 	while (*p) {
> >-		struct i915_priotree *pos;
> >-
> > 		rb = *p;
> >-		pos = rb_entry(rb, typeof(*pos), node);
> >-		if (pt->priority > pos->priority) {
> >+		plist = rb_entry(rb, typeof(*plist), node);
> >+		if (prio > plist->priority) {
> > 			p = &rb->rb_left;
> >-		} else {
> >+		} else if (prio < plist->priority) {
> > 			p = &rb->rb_right;
> > 			first = false;
> >+		} else {
> >+			list_add_tail(&pt->link, &plist->requests);
> >+			return false;
> > 		}
> > 	}
> >-	rb_link_node(&pt->node, rb, p);
> >-	rb_insert_color(&pt->node, root);
> >+
> >+	if (!prio) {
> >+		plist = &engine->default_priolist;
> 
> Should be "prio == I915_PRIO_DEFAULT" (give or take).
> 
> But I am not completely happy with special casing the default
> priority for two reasons.
> 
> Firstly, userspace can opt to lower its priority and completely
> defeat this path.
> 
> Secondly, we already have flip priority which perhaps should have
> it's own fast path / avoid allocation as well.
> 
> Those two combined make me unsure whether the optimisation is worth
> it. What would be the pros and cons of three steps:
> 
> 1. No optimisation.
> 2. prio == default optimisation like above.
> 3. Better system with caching of frequently used levels.
> 
> Last is definitely complicated, second is not, but is the second
> much better than the first?

It was not intended as an optimisation. It is for handling the
ENOMEM here. We cannot abort the request at such a late stage, so we
need somewhere to hold it. That dictated having a preallocted slot. I
also didn't like having to preallocate all possible levels as that seems
a waste, especially as I like to invent new levels and suspect that we
may end up using a full u32 range.

Using it for the default priority was then to take advantage of the
preallocation.

> Perhaps a simplification of 3) where we would defer the freeing of
> unused priority levels until the busy to idle transition? That would
> also drop the existence and need for special handling of
> engine->default_prio.
> 
> >+	} else {
> >+		plist = kmalloc(sizeof(*plist), GFP_ATOMIC);
> >+		/* Convert an allocation failure to a priority bump */
> 
> Where is the priority bump? It looks like it can be the opposite for
> high prio requests below.

Correct. Bump was the best verb I thought of.

> I don't think it matters what happens with priorities hugely when
> small allocations start to go bad but would like to understand the
> comment.
> 
> And perhaps this would be worthy of a dedicated slab cache?

Even with a slab cache, we cannot prevent allocation failure. I don't
think priority levels will be frequent enough to really justify one.
Should be a good match for the common kmalloc-64 slab.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre