[Intel-gfx] [PATCH 07/27] drm/i915: Squash repeated awaits on the same fence

Wed Apr 26 14:36:19 UTC 2017

On 26/04/2017 13:23, Chris Wilson wrote:
> On Wed, Apr 26, 2017 at 01:13:41PM +0100, Tvrtko Ursulin wrote:
>>
>> On 26/04/2017 12:18, Chris Wilson wrote:
>>> On Wed, Apr 26, 2017 at 11:54:08AM +0100, Tvrtko Ursulin wrote:
>>>>>>> +/* struct intel_timeline_sync is a layer of a radixtree that maps a u64 fence
>>>>>>> + * context id to the last u32 fence seqno waited upon from that context.
>>>>>>> + * Unlike lib/radixtree it uses a parent pointer that allows traversal back to
>>>>>>> + * the root. This allows us to access the whole tree via a single pointer
>>>>>>> + * to the most recently used layer. We expect fence contexts to be dense
>>>>>>> + * and most reuse to be on the same i915_gem_context but on neighbouring
>>>>>>> + * engines (i.e. on adjacent contexts) and reuse the same leaf, a very
>>>>>>> + * effective lookup cache. If the new lookup is not on the same leaf, we
>>>>>>> + * expect it to be on the neighbouring branch.
>>>>>>> + *
>>>>>>> + * A leaf holds an array of u32 seqno, and has height 0. The bitmap field
>>>>>>> + * allows us to store whether a particular seqno is valid (i.e. allows us
>>>>>>> + * to distinguish unset from 0).
>>>>>>> + *
>>>>>>> + * A branch holds an array of layer pointers, and has height > 0, and always
>>>>>>> + * has at least 2 layers (either branches or leaves) below it.
>>>>>>> + *
>>>>>>> + */
>>>>>>
>>>>>> @_@ :)
>>>>>>
>>>>>> Ok, so a map of u64 to u32. We can't use IDR or radixtree directly
>>>>>> because of u64 keys. :( How about a hash table? It would be much
>>>>>> simpler to review. :) Seriously, if it would perform close enough it
>>>>>> would be a much much simpler implementation.
>>>>>
>>>>> You want a resizable hashtable. rht is appallingly slow, so you want a
>>>>> custom resizeable ht. They are not as simple as this codewise ;)
>>>>> (Plus a compressed radixtree is part of my plan for scalability
>>>>> improvements for struct reservation_object.)
>>>>
>>>> Why resizable? I was thinking a normal one. If at any given time we
>>>> have an active set of contexts, or at least lookups are as you say
>>>> below, to neighbouring contexts, that would mean we are talking
>>>> about lookups to different hash buckets.  And for the typical
>>>> working set we would expect many collisions so longer lists in each
>>>> bucket? So maybe NUM_ENGINES * some typical load constant number
>>>> buckets would not be that bad?
>>>
>>> Consider a long running display server that will accumulate 10,000s of
>>> thousands of clients in its lifetime, each with their own contents that
>>> get shared by passing around fences/framebuffers. (Or on a shorter scale
>>> any of the context stress tests in igt.) Due to the non-recycling of the
>>> context ids, we can grow to very large tables - but we have no knowledge
>>> of what contexts are no longer required.
>>>
>>> To compensate we need to occasionally prune the sync points. For a ht we
>>> could just scrap it, For an idr, we could store last use and delete
>>> stale leaves.
>>
>> Hm, pruning yes.. but you don't have pruning in this patch either.
>> So that's something which needs to be addressed either way.
>
> I know, review is great for hindsight. The iterator/remove for the
> compressed idr is going to be on the uglier side of the insertion. Ugly
> enough to be a seperate patch.
>
>>> But first we have a question of how many buckets do we give the static
>>> ht? Most processes will be sharing between 2 contexts (render context,
>>> presentation context) except for a display server who may have 10-100 of
>>> clients - and possibly where eliminating repeated syncs is going to be
>>> most valuable. That suggests 256 buckets for every timeline (assuming
>>> just a pair of engines across shared contexts). Overkill for the
>>> majority, and going to be miserable in stress tests.
>>>
>>> Finally what do you store in the ht? Fences are the obvious candidate,
>>> but need reaping. Or you just create a ht for the request and only
>>> squash repeated fences inside a single request - that doesn't benefit
>> >from timeline tracking and squashing between requests (but does avoid
>>> keeping fences around forever). Hence why I went with tracking seqno. To
>>> avoid allocations for the ht nodes, we could create an open-addressed ht
>>> with the {context, seqno} embedded in it. It would be efficient, but
>>> needs online resizing and is a fair chunk of new code.
>>
>> I was thinking of exactly the same thing as this patch does, u64
>> context id as key, u32 seqnos (wrapped in a container with
>> hlist_node).
>
> Hmm, a hashed radixtree. I did read something about a hybrid approach.
>
>> Ok, so the key is pruning to keep the display server scenario in check.
>>
>> Free the key from i915_fence_release?
>
> Too early, it's the timeline (and syncs along it) that's interesting.
> For our contexts, we can hook into context-close, but we still have some
> foreign dma-fence-contexts to worry about. I was thinking of walking all
> timelines from the idle_worker. And possibly forcibly prune across
> suspend.

Hm I don't see why it is too early. If request is getting freed, there 
is no one waiting on it any longer, so how can it be OK to keep that 
seqno in the map?

But yes, sounds easy to do it from the idle worker. Just walk everything 
and prune when engine seqno has advanced past it?

Regards,

Tvrtko