[Intel-gfx] [PATCH 15/27] drm/i915: Split execlist priority queue into rbtree + linked list

Tvrtko Ursulin tvrtko.ursulin at linux.intel.com
Mon Apr 24 12:44:53 UTC 2017


On 24/04/2017 12:07, Chris Wilson wrote:
> On Mon, Apr 24, 2017 at 11:28:32AM +0100, Tvrtko Ursulin wrote:
>>
>> On 19/04/2017 10:41, Chris Wilson wrote:
>>> All the requests at the same priority are executed in FIFO order. They
>>> do not need to be stored in the rbtree themselves, as they are a simple
>>> list within a level. If we move the requests at one priority into a list,
>>> we can then reduce the rbtree to the set of priorities. This should keep
>>> the height of the rbtree small, as the number of active priorities can not
>>> exceed the number of active requests and should be typically only a few.
>>>
>>> Currently, we have ~2k possible different priority levels, that may
>>> increase to allow even more fine grained selection. Allocating those in
>>> advance seems a waste (and may be impossible), so we opt for allocating
>>> upon first use, and freeing after its requests are depleted. To avoid
>>> the possibility of an allocation failure causing us to lose a request,
>>> we preallocate the default priority (0) and bump any request to that
>>> priority if we fail to allocate it the appropriate plist. Having a
>>> request (that is ready to run, so not leading to corruption) execute
>>> out-of-order is better than leaking the request (and its dependency
>>> tree) entirely.
>>>
>>> There should be a benefit to reducing execlists_dequeue() to principally
>>> using a simple list (and reducing the frequency of both rbtree iteration
>>> and balancing on erase) but for typical workloads, request coalescing
>>> should be small enough that we don't notice any change. The main gain is
>> >from improving PI calls to schedule, and the explicit list within a
>>> level should make request unwinding simpler (we just need to insert at
>>> the head of the list rather than the tail and not have to make the
>>> rbtree search more complicated).
>>
>> Sounds attractive! What workloads show the benefit and how much?
>
> The default will show the best, since everything is priority 0 more or
> less and so we reduce the rbtree search to a single lookup and list_add.
> It's hard to measure the impact of the rbtree though. On the dequeue
> side, the mmio access dominates. On the schedule side, if we have lots
> of requests, the dfs dominates.
>
> I have an idea on how we might stress the rbtree in submit_request - but
> still it requires long queues untypical of most workloads. Still tbd.
>
>>> -static bool insert_request(struct i915_priotree *pt, struct rb_root *root)
>>> +static bool
>>> +insert_request(struct intel_engine_cs *engine,
>>> +	       struct i915_priotree *pt,
>>> +	       int prio)
>>> {
>>> +	struct execlist_priolist *plist;
>>> 	struct rb_node **p, *rb;
>>> 	bool first = true;
>>>
>>> +find_plist:
>>> 	/* most positive priority is scheduled first, equal priorities fifo */
>>> 	rb = NULL;
>>> -	p = &root->rb_node;
>>> +	p = &engine->execlist_queue.rb_node;
>>> 	while (*p) {
>>> -		struct i915_priotree *pos;
>>> -
>>> 		rb = *p;
>>> -		pos = rb_entry(rb, typeof(*pos), node);
>>> -		if (pt->priority > pos->priority) {
>>> +		plist = rb_entry(rb, typeof(*plist), node);
>>> +		if (prio > plist->priority) {
>>> 			p = &rb->rb_left;
>>> -		} else {
>>> +		} else if (prio < plist->priority) {
>>> 			p = &rb->rb_right;
>>> 			first = false;
>>> +		} else {
>>> +			list_add_tail(&pt->link, &plist->requests);
>>> +			return false;
>>> 		}
>>> 	}
>>> -	rb_link_node(&pt->node, rb, p);
>>> -	rb_insert_color(&pt->node, root);
>>> +
>>> +	if (!prio) {
>>> +		plist = &engine->default_priolist;
>>
>> Should be "prio == I915_PRIO_DEFAULT" (give or take).
>>
>> But I am not completely happy with special casing the default
>> priority for two reasons.
>>
>> Firstly, userspace can opt to lower its priority and completely
>> defeat this path.
>>
>> Secondly, we already have flip priority which perhaps should have
>> it's own fast path / avoid allocation as well.
>>
>> Those two combined make me unsure whether the optimisation is worth
>> it. What would be the pros and cons of three steps:
>>
>> 1. No optimisation.
>> 2. prio == default optimisation like above.
>> 3. Better system with caching of frequently used levels.
>>
>> Last is definitely complicated, second is not, but is the second
>> much better than the first?
>
> It was not intended as an optimisation. It is for handling the
> ENOMEM here. We cannot abort the request at such a late stage, so we
> need somewhere to hold it. That dictated having a preallocted slot. I
> also didn't like having to preallocate all possible levels as that seems
> a waste, especially as I like to invent new levels and suspect that we
> may end up using a full u32 range.
>
> Using it for the default priority was then to take advantage of the
> preallocation.
>
>> Perhaps a simplification of 3) where we would defer the freeing of
>> unused priority levels until the busy to idle transition? That would
>> also drop the existence and need for special handling of
>> engine->default_prio.
>>
>>> +	} else {
>>> +		plist = kmalloc(sizeof(*plist), GFP_ATOMIC);
>>> +		/* Convert an allocation failure to a priority bump */
>>
>> Where is the priority bump? It looks like it can be the opposite for
>> high prio requests below.
>
> Correct. Bump was the best verb I thought of.
>
>> I don't think it matters what happens with priorities hugely when
>> small allocations start to go bad but would like to understand the
>> comment.
>>
>> And perhaps this would be worthy of a dedicated slab cache?
>
> Even with a slab cache, we cannot prevent allocation failure. I don't
> think priority levels will be frequent enough to really justify one.
> Should be a good match for the common kmalloc-64 slab.

We could keep a pre-allocated entry with each engine which would 
transfer ownership with insert_request. It would have to be allocated at 
a point where we can fail like request_alloc, but downside would be 
starting to take engine timeline lock in request alloc path. Only to 
check and preallocate if needed, but still. And it would mean more 
traffic on the slab API in that path as well. Oh well, not very nice. 
Was just thinking if we can avoid GFP_ATOMIC and the default priority 
fallback. It seems like your solution is a better compromise.

A couple more question on the patch details then.

Could you implement the list handling in a more obvious way, instead of 
link.next == link.prev use a more obvious list_empty on the 
plist->requests, why __list_del_entry and not just list_del and you have 
a list_add_tail as well which could be just list_add since the list is 
empty at that point and _tail falsely suggests the _tail is important.

Regards,

Tvrtko












More information about the Intel-gfx mailing list