[Intel-gfx] [PATCH 2/2] drm/i915/guc: default to using GuC submission where possible

Tue Apr 26 10:35:33 UTC 2016

On Tue, Apr 26, 2016 at 10:52:41AM +0100, Dave Gordon wrote:
> On 26/04/16 09:49, Dave Gordon wrote:
> >On 25/04/16 11:39, Chris Wilson wrote:
> >>On Mon, Apr 25, 2016 at 11:07:13AM +0100, Dave Gordon wrote:
> >>>On 22/04/16 19:45, Chris Wilson wrote:
> 
> [snip]
> 
> >>>>And what exactly is that atomic64_cmpxchg() serialising with? There are
> >>>>no other CPUs contending with the write, and neither does the GuC
> >>>>(and I
> >>>>doubt it is taking any notice of the lock cmpxchg). Using cmpxchg where
> >>>>a single WRITE_ONCE() of a 32bit value wins the perf prize for hotest
> >>>>instruction and function in the kernel.
> >>>
> >>>The doorbell controller hardware, I should think. The BSpec
> >>>describes using LOCK_CMPXCHG8B to update doorbells, so I think this
> >>>code is just based on what it says there. If the CPU hardware
> >>>doesn't implement it efficiently, surely the GPU h/w designers
> >>>wouldn't have mandated it in this way?
> >>
> >>Wow, I'm surprised that they would put into the same domain. Still,
> >>unless you are actually serialising with another writer, what is the
> >>point of using lock cmpxchg? E.g. an xchg would be enough to enforce
> >>ordering, and you should ask them again if this is not a little overkill
> >>for one-way signaling.
> >>-Chris
> 
> As for performance, while LOCK_CMPXCHG8B might be an expensive
> instruction, we're only executing ONE per request. I suspect that
> the cumulative cost of all the extra memory accesses caused by extra
> indirections and poor structure layout cost far more than any single
> instruction ever can.
> 
> Top things in this area might be:
> 
> * all the macros taking "dev" instead of "dev_priv"
> * pointer dances in general (a->b->c.d->e) where we could add a
> shortcut pointer from a to c (or c.d), or from a or b to e.
> * way too much repetition of a->b->c, a->b->d, a->b->e in the same
> function, which the compiler *may* optimise, but probably won't if
> there are any function calls around. Adding a local for a->b will
> almost certainly help, or at least incur no penalty and be easier to
> read.
> * awkwardly sized or misaligned structure members, and bitfield
> bools rather than 1-byte flags
> 
> So let's nibble away at these before we worry about the cost of a
> single x86 instruction!

You can either assume that I applied the patches I sent to the ml for
the above points, or just look at the delta between execlists and guc
and worry about the regressions.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre