[Intel-gfx] [PATCH 15/27] drm/i915: Split execlist priority queue into rbtree + linked list

Mon Apr 24 13:06:08 UTC 2017

On Mon, Apr 24, 2017 at 01:44:53PM +0100, Tvrtko Ursulin wrote:
> 
> On 24/04/2017 12:07, Chris Wilson wrote:
> >On Mon, Apr 24, 2017 at 11:28:32AM +0100, Tvrtko Ursulin wrote:
> >>
> >>On 19/04/2017 10:41, Chris Wilson wrote:
> >>>All the requests at the same priority are executed in FIFO order. They
> >>>do not need to be stored in the rbtree themselves, as they are a simple
> >>>list within a level. If we move the requests at one priority into a list,
> >>>we can then reduce the rbtree to the set of priorities. This should keep
> >>>the height of the rbtree small, as the number of active priorities can not
> >>>exceed the number of active requests and should be typically only a few.
> >>>
> >>>Currently, we have ~2k possible different priority levels, that may
> >>>increase to allow even more fine grained selection. Allocating those in
> >>>advance seems a waste (and may be impossible), so we opt for allocating
> >>>upon first use, and freeing after its requests are depleted. To avoid
> >>>the possibility of an allocation failure causing us to lose a request,
> >>>we preallocate the default priority (0) and bump any request to that
> >>>priority if we fail to allocate it the appropriate plist. Having a
> >>>request (that is ready to run, so not leading to corruption) execute
> >>>out-of-order is better than leaking the request (and its dependency
> >>>tree) entirely.
> >>>
> >>>There should be a benefit to reducing execlists_dequeue() to principally
> >>>using a simple list (and reducing the frequency of both rbtree iteration
> >>>and balancing on erase) but for typical workloads, request coalescing
> >>>should be small enough that we don't notice any change. The main gain is
> >>>from improving PI calls to schedule, and the explicit list within a
> >>>level should make request unwinding simpler (we just need to insert at
> >>>the head of the list rather than the tail and not have to make the
> >>>rbtree search more complicated).
> >>
> >>Sounds attractive! What workloads show the benefit and how much?
> >
> >The default will show the best, since everything is priority 0 more or
> >less and so we reduce the rbtree search to a single lookup and list_add.
> >It's hard to measure the impact of the rbtree though. On the dequeue
> >side, the mmio access dominates. On the schedule side, if we have lots
> >of requests, the dfs dominates.
> >
> >I have an idea on how we might stress the rbtree in submit_request - but
> >still it requires long queues untypical of most workloads. Still tbd.
> >
> >>>-static bool insert_request(struct i915_priotree *pt, struct rb_root *root)
> >>>+static bool
> >>>+insert_request(struct intel_engine_cs *engine,
> >>>+	       struct i915_priotree *pt,
> >>>+	       int prio)
> >>>{
> >>>+	struct execlist_priolist *plist;
> >>>	struct rb_node **p, *rb;
> >>>	bool first = true;
> >>>
> >>>+find_plist:
> >>>	/* most positive priority is scheduled first, equal priorities fifo */
> >>>	rb = NULL;
> >>>-	p = &root->rb_node;
> >>>+	p = &engine->execlist_queue.rb_node;
> >>>	while (*p) {
> >>>-		struct i915_priotree *pos;
> >>>-
> >>>		rb = *p;
> >>>-		pos = rb_entry(rb, typeof(*pos), node);
> >>>-		if (pt->priority > pos->priority) {
> >>>+		plist = rb_entry(rb, typeof(*plist), node);
> >>>+		if (prio > plist->priority) {
> >>>			p = &rb->rb_left;
> >>>-		} else {
> >>>+		} else if (prio < plist->priority) {
> >>>			p = &rb->rb_right;
> >>>			first = false;
> >>>+		} else {
> >>>+			list_add_tail(&pt->link, &plist->requests);
> >>>+			return false;
> >>>		}
> >>>	}
> >>>-	rb_link_node(&pt->node, rb, p);
> >>>-	rb_insert_color(&pt->node, root);
> >>>+
> >>>+	if (!prio) {
> >>>+		plist = &engine->default_priolist;
> >>
> >>Should be "prio == I915_PRIO_DEFAULT" (give or take).
> >>
> >>But I am not completely happy with special casing the default
> >>priority for two reasons.
> >>
> >>Firstly, userspace can opt to lower its priority and completely
> >>defeat this path.
> >>
> >>Secondly, we already have flip priority which perhaps should have
> >>it's own fast path / avoid allocation as well.
> >>
> >>Those two combined make me unsure whether the optimisation is worth
> >>it. What would be the pros and cons of three steps:
> >>
> >>1. No optimisation.
> >>2. prio == default optimisation like above.
> >>3. Better system with caching of frequently used levels.
> >>
> >>Last is definitely complicated, second is not, but is the second
> >>much better than the first?
> >
> >It was not intended as an optimisation. It is for handling the
> >ENOMEM here. We cannot abort the request at such a late stage, so we
> >need somewhere to hold it. That dictated having a preallocted slot. I
> >also didn't like having to preallocate all possible levels as that seems
> >a waste, especially as I like to invent new levels and suspect that we
> >may end up using a full u32 range.
> >
> >Using it for the default priority was then to take advantage of the
> >preallocation.
> >
> >>Perhaps a simplification of 3) where we would defer the freeing of
> >>unused priority levels until the busy to idle transition? That would
> >>also drop the existence and need for special handling of
> >>engine->default_prio.
> >>
> >>>+	} else {
> >>>+		plist = kmalloc(sizeof(*plist), GFP_ATOMIC);
> >>>+		/* Convert an allocation failure to a priority bump */
> >>
> >>Where is the priority bump? It looks like it can be the opposite for
> >>high prio requests below.
> >
> >Correct. Bump was the best verb I thought of.
> >
> >>I don't think it matters what happens with priorities hugely when
> >>small allocations start to go bad but would like to understand the
> >>comment.
> >>
> >>And perhaps this would be worthy of a dedicated slab cache?
> >
> >Even with a slab cache, we cannot prevent allocation failure. I don't
> >think priority levels will be frequent enough to really justify one.
> >Should be a good match for the common kmalloc-64 slab.
> 
> We could keep a pre-allocated entry with each engine which would
> transfer ownership with insert_request. It would have to be
> allocated at a point where we can fail like request_alloc, but
> downside would be starting to take engine timeline lock in request
> alloc path. Only to check and preallocate if needed, but still. And
> it would mean more traffic on the slab API in that path as well. Oh
> well, not very nice. Was just thinking if we can avoid GFP_ATOMIC
> and the default priority fallback. It seems like your solution is a
> better compromise.

No worries. I didn't particular like the idea of reserving a slot with
each request either or having a GFP_ATOMIC nestled so deep in request
submission. It is certainly possible for us to do the allocation at
request_alloc and carry it through to schedule - but still
that only covers the first call to schedule. It seems like we would have
to resort to always pass in a slot, and free that slot if unused.

> A couple more question on the patch details then.
> 
> Could you implement the list handling in a more obvious way, instead
> of link.next == link.prev use a more obvious list_empty on the
> plist->requests, why __list_del_entry and not just list_del and you
> have a list_add_tail as well which could be just list_add since the
> list is empty at that point and _tail falsely suggests the _tail is
> important.

After a few more passes, it is saner now (with just one debatable list
handling tweak from ages ago)

static inline void __list_del_many(struct list_head *head,
                                   struct list_head *first)
{
        head->next = first;
        first->prev = head;
}
...
	rb = engine->execlist_first;
	GEM_BUG_ON(rb_first(&engine->execlist_queue) != rb);
	while (rb) {
		struct execlist_priolist *plist =
			rb_entry(rb, typeof(*plist), node);
		struct drm_i915_gem_request *rq, *rn;

		list_for_each_entry_safe(rq, rn,
					 &plist->requests, priotree.link) {
			if(!merge(rq)) { /* blah */
				__list_del_many(&plist->requests,
						&rq->priotree.link);
				goto done;
			}

			INIT_LIST_HEAD(&rq->priotree.link);
			rq->priotree.priority = INT_MAX;

			... /*__i915_gem_request_submit(rq)); */
		}

		rb = rb_next(rb);
		rb_erase(&plist->node, &engine->execlist_queue);
		INIT_LIST_HEAD(&plist->requests);
		if (plist->priority)
			kfree(plist)
	}

You may notice I have a dislike of the cache misses from lists. :|
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre