[Intel-gfx] [PATCH] drm/i915: Replace some more busy waits with normal ones

Thu Mar 24 12:27:31 UTC 2016

On Thu, Mar 24, 2016 at 11:37:07AM +0000, Tvrtko Ursulin wrote:
> 
> On 23/03/16 16:40, Chris Wilson wrote:
> >On Wed, Mar 23, 2016 at 04:24:48PM +0000, Tvrtko Ursulin wrote:
> >>Biggest thing to make sure is that you don't add a lot of cycles to
> >>the forcewake loops since for example fw_domains_get can be the
> >>hottest i915 function on some benchmarks.
> >>
> >>(This area slightly annoys me anyway with redundant looping over
> >>forcewake domains and we could also potentially optimize the ack
> >>waiting by first requesting all we want, and then doing the waits.
> >>That would be one additional loop, but if removed the other one,
> >>code would stay at the same number of domain loops.)
> >
> >I hear you. I just end up weeping in the corner when I see fw_domain_get
> >on the profile.
> >
> >We already do have a mitigation scheme to hold onto the forcewake for an
> >extra jiffie every time. I don't like it, but without it fw_domains_get
> >becomes a real hog.
> 
> I am pretty sure I've seen some tests which somehow defeat the
> jiffie delay and we end up re-acquiring every ms/jiffie. This is
> something I wanted to get to the bottom of but did not get round to
> yet. It was totally unexpected because the test is hammering on
> everything.

Absolutely sure it is not just the delay in acquiring the ack? And
spinning on waiting for the thread_c0 doesn't come cheap? I've just
written off fw_domain_get being high on the profiles simply due to that
we have to spin so long (I'm jaded because on Sandybridge spinning for
50us+ isn't uncommon iirc).

You're right though, we should instrument it and check it is working.

> >Note that one thing we can actually do is restrict the domains we wakeup
> >for the engines (engine->fw_domain) in execlists_submit, that should
> >help chv/skl+ a small amount.
> 
> I even have a patch to do that somewhere. :)

So did I!

> >I don't have a good idea for how to keep rc6 residency high but avoid
> >forcewake when those darn elsp require forcewake. As does gen6+ legacy
> >RING_TAIL writes. And even then that spinlock causes quite a bit of
> >traffic when it shouldn't be contended. I've been thinking of whether we
> >can have multiple locks (hashed by register) but we would then still
> >need some cross-communication for the common forcewake.
> 
> Maybe it is not worth it at this point. This is pretty well
> optimised now and could switch to the next target. Like maybe move
> to active and retire__read, or retired_req_list, or something.

There is a challenge in that we are both lazy in how often we check for
retirement/idleness (because it's work that we can postpone and if we do
so, quite often that work is no longer required!) and that just because
the driver believes the GPU should be busy, it can quite happily power
itself down between operations (though that's really for gen6/gen7
semaphores with the current state of the driver).

I don't have a better plan then to reduce the mmio and spinlocks where
possible.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre