[Intel-gfx] [PATCH 15/27] drm/i915: Split execlist priority queue into rbtree + linked list
Chris Wilson
chris at chris-wilson.co.uk
Mon Apr 24 11:07:47 UTC 2017
On Mon, Apr 24, 2017 at 11:28:32AM +0100, Tvrtko Ursulin wrote:
>
> On 19/04/2017 10:41, Chris Wilson wrote:
> >All the requests at the same priority are executed in FIFO order. They
> >do not need to be stored in the rbtree themselves, as they are a simple
> >list within a level. If we move the requests at one priority into a list,
> >we can then reduce the rbtree to the set of priorities. This should keep
> >the height of the rbtree small, as the number of active priorities can not
> >exceed the number of active requests and should be typically only a few.
> >
> >Currently, we have ~2k possible different priority levels, that may
> >increase to allow even more fine grained selection. Allocating those in
> >advance seems a waste (and may be impossible), so we opt for allocating
> >upon first use, and freeing after its requests are depleted. To avoid
> >the possibility of an allocation failure causing us to lose a request,
> >we preallocate the default priority (0) and bump any request to that
> >priority if we fail to allocate it the appropriate plist. Having a
> >request (that is ready to run, so not leading to corruption) execute
> >out-of-order is better than leaking the request (and its dependency
> >tree) entirely.
> >
> >There should be a benefit to reducing execlists_dequeue() to principally
> >using a simple list (and reducing the frequency of both rbtree iteration
> >and balancing on erase) but for typical workloads, request coalescing
> >should be small enough that we don't notice any change. The main gain is
> >from improving PI calls to schedule, and the explicit list within a
> >level should make request unwinding simpler (we just need to insert at
> >the head of the list rather than the tail and not have to make the
> >rbtree search more complicated).
>
> Sounds attractive! What workloads show the benefit and how much?
The default will show the best, since everything is priority 0 more or
less and so we reduce the rbtree search to a single lookup and list_add.
It's hard to measure the impact of the rbtree though. On the dequeue
side, the mmio access dominates. On the schedule side, if we have lots
of requests, the dfs dominates.
I have an idea on how we might stress the rbtree in submit_request - but
still it requires long queues untypical of most workloads. Still tbd.
> >-static bool insert_request(struct i915_priotree *pt, struct rb_root *root)
> >+static bool
> >+insert_request(struct intel_engine_cs *engine,
> >+ struct i915_priotree *pt,
> >+ int prio)
> > {
> >+ struct execlist_priolist *plist;
> > struct rb_node **p, *rb;
> > bool first = true;
> >
> >+find_plist:
> > /* most positive priority is scheduled first, equal priorities fifo */
> > rb = NULL;
> >- p = &root->rb_node;
> >+ p = &engine->execlist_queue.rb_node;
> > while (*p) {
> >- struct i915_priotree *pos;
> >-
> > rb = *p;
> >- pos = rb_entry(rb, typeof(*pos), node);
> >- if (pt->priority > pos->priority) {
> >+ plist = rb_entry(rb, typeof(*plist), node);
> >+ if (prio > plist->priority) {
> > p = &rb->rb_left;
> >- } else {
> >+ } else if (prio < plist->priority) {
> > p = &rb->rb_right;
> > first = false;
> >+ } else {
> >+ list_add_tail(&pt->link, &plist->requests);
> >+ return false;
> > }
> > }
> >- rb_link_node(&pt->node, rb, p);
> >- rb_insert_color(&pt->node, root);
> >+
> >+ if (!prio) {
> >+ plist = &engine->default_priolist;
>
> Should be "prio == I915_PRIO_DEFAULT" (give or take).
>
> But I am not completely happy with special casing the default
> priority for two reasons.
>
> Firstly, userspace can opt to lower its priority and completely
> defeat this path.
>
> Secondly, we already have flip priority which perhaps should have
> it's own fast path / avoid allocation as well.
>
> Those two combined make me unsure whether the optimisation is worth
> it. What would be the pros and cons of three steps:
>
> 1. No optimisation.
> 2. prio == default optimisation like above.
> 3. Better system with caching of frequently used levels.
>
> Last is definitely complicated, second is not, but is the second
> much better than the first?
It was not intended as an optimisation. It is for handling the
ENOMEM here. We cannot abort the request at such a late stage, so we
need somewhere to hold it. That dictated having a preallocted slot. I
also didn't like having to preallocate all possible levels as that seems
a waste, especially as I like to invent new levels and suspect that we
may end up using a full u32 range.
Using it for the default priority was then to take advantage of the
preallocation.
> Perhaps a simplification of 3) where we would defer the freeing of
> unused priority levels until the busy to idle transition? That would
> also drop the existence and need for special handling of
> engine->default_prio.
>
> >+ } else {
> >+ plist = kmalloc(sizeof(*plist), GFP_ATOMIC);
> >+ /* Convert an allocation failure to a priority bump */
>
> Where is the priority bump? It looks like it can be the opposite for
> high prio requests below.
Correct. Bump was the best verb I thought of.
> I don't think it matters what happens with priorities hugely when
> small allocations start to go bad but would like to understand the
> comment.
>
> And perhaps this would be worthy of a dedicated slab cache?
Even with a slab cache, we cannot prevent allocation failure. I don't
think priority levels will be frequent enough to really justify one.
Should be a good match for the common kmalloc-64 slab.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
More information about the Intel-gfx
mailing list