[Intel-gfx] [PATCH 2/2] drm/i915/guc: default to using GuC submission where possible

Tue Apr 26 13:36:21 UTC 2016

On 26/04/16 11:35, Chris Wilson wrote:
> On Tue, Apr 26, 2016 at 10:52:41AM +0100, Dave Gordon wrote:
>> On 26/04/16 09:49, Dave Gordon wrote:
>>> On 25/04/16 11:39, Chris Wilson wrote:
>>>> On Mon, Apr 25, 2016 at 11:07:13AM +0100, Dave Gordon wrote:
>>>>> On 22/04/16 19:45, Chris Wilson wrote:
>>
>> [snip]
>>
>>>>>> And what exactly is that atomic64_cmpxchg() serialising with? There are
>>>>>> no other CPUs contending with the write, and neither does the GuC
>>>>>> (and I
>>>>>> doubt it is taking any notice of the lock cmpxchg). Using cmpxchg where
>>>>>> a single WRITE_ONCE() of a 32bit value wins the perf prize for hotest
>>>>>> instruction and function in the kernel.
>>>>>
>>>>> The doorbell controller hardware, I should think. The BSpec
>>>>> describes using LOCK_CMPXCHG8B to update doorbells, so I think this
>>>>> code is just based on what it says there. If the CPU hardware
>>>>> doesn't implement it efficiently, surely the GPU h/w designers
>>>>> wouldn't have mandated it in this way?
>>>>
>>>> Wow, I'm surprised that they would put into the same domain. Still,
>>>> unless you are actually serialising with another writer, what is the
>>>> point of using lock cmpxchg? E.g. an xchg would be enough to enforce
>>>> ordering, and you should ask them again if this is not a little overkill
>>>> for one-way signaling.
>>>> -Chris
>>
>> As for performance, while LOCK_CMPXCHG8B might be an expensive
>> instruction, we're only executing ONE per request. I suspect that
>> the cumulative cost of all the extra memory accesses caused by extra
>> indirections and poor structure layout cost far more than any single
>> instruction ever can.
>>
>> Top things in this area might be:
>>
>> * all the macros taking "dev" instead of "dev_priv"
>> * pointer dances in general (a->b->c.d->e) where we could add a
>> shortcut pointer from a to c (or c.d), or from a or b to e.
>> * way too much repetition of a->b->c, a->b->d, a->b->e in the same
>> function, which the compiler *may* optimise, but probably won't if
>> there are any function calls around. Adding a local for a->b will
>> almost certainly help, or at least incur no penalty and be easier to
>> read.
>> * awkwardly sized or misaligned structure members, and bitfield
>> bools rather than 1-byte flags
>>
>> So let's nibble away at these before we worry about the cost of a
>> single x86 instruction!
>
> You can either assume that I applied the patches I sent to the ml for
> the above points, or just look at the delta between execlists and guc
> and worry about the regressions.
> -Chris

With the latest version of this patchset (not yet posted), I see GuC 
submission just slightly *faster* than execlists on the render ring :)
However it's still slower on the others, where the minimum execlist time 
is less.

Just running noops, execlist submission on render shows a downward trend 
as more noop batches are submitted, flattening to a limit of just over 
3us/batch. GuC mode shows the same trend, and bottoms out at just UNDER 
3us/batch. The crossover seems to be 512-1024 batches/cycle, where the 
reduced interrupt overhead outweighs the added latency.

It probably helps that we're no longer mapping & unmapping the doorbell 
page on each request!

.Dave.