[Intel-gfx] [PATCH 2/2] drm/i915/guc: default to using GuC submission where possible
Dave Gordon
david.s.gordon at intel.com
Tue Apr 26 09:52:41 UTC 2016
On 26/04/16 09:49, Dave Gordon wrote:
> On 25/04/16 11:39, Chris Wilson wrote:
>> On Mon, Apr 25, 2016 at 11:07:13AM +0100, Dave Gordon wrote:
>>> On 22/04/16 19:45, Chris Wilson wrote:
[snip]
>>>> And what exactly is that atomic64_cmpxchg() serialising with? There are
>>>> no other CPUs contending with the write, and neither does the GuC
>>>> (and I
>>>> doubt it is taking any notice of the lock cmpxchg). Using cmpxchg where
>>>> a single WRITE_ONCE() of a 32bit value wins the perf prize for hotest
>>>> instruction and function in the kernel.
>>>
>>> The doorbell controller hardware, I should think. The BSpec
>>> describes using LOCK_CMPXCHG8B to update doorbells, so I think this
>>> code is just based on what it says there. If the CPU hardware
>>> doesn't implement it efficiently, surely the GPU h/w designers
>>> wouldn't have mandated it in this way?
>>
>> Wow, I'm surprised that they would put into the same domain. Still,
>> unless you are actually serialising with another writer, what is the
>> point of using lock cmpxchg? E.g. an xchg would be enough to enforce
>> ordering, and you should ask them again if this is not a little overkill
>> for one-way signaling.
>> -Chris
As for performance, while LOCK_CMPXCHG8B might be an expensive
instruction, we're only executing ONE per request. I suspect that the
cumulative cost of all the extra memory accesses caused by extra
indirections and poor structure layout cost far more than any single
instruction ever can.
Top things in this area might be:
* all the macros taking "dev" instead of "dev_priv"
* pointer dances in general (a->b->c.d->e) where we could add a shortcut
pointer from a to c (or c.d), or from a or b to e.
* way too much repetition of a->b->c, a->b->d, a->b->e in the same
function, which the compiler *may* optimise, but probably won't if there
are any function calls around. Adding a local for a->b will almost
certainly help, or at least incur no penalty and be easier to read.
* awkwardly sized or misaligned structure members, and bitfield bools
rather than 1-byte flags
So let's nibble away at these before we worry about the cost of a single
x86 instruction!
.Dave.
More information about the Intel-gfx
mailing list