[Intel-gfx] [PATCH 7/8] drm/i915/pmu: Wire up engine busy stats to PMU

Tue Sep 26 20:01:34 UTC 2017

Quoting Rogozhkin, Dmitry V (2017-09-26 19:46:48)
> On Tue, 2017-09-26 at 13:32 +0100, Tvrtko Ursulin wrote:
> > On 25/09/2017 18:48, Chris Wilson wrote:
> > > Quoting Tvrtko Ursulin (2017-09-25 16:15:42)
> > >> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> > >>
> > >> We can use engine busy stats instead of the MMIO sampling timer
> > >> for better efficiency.
> > >>
> > >> As minimum this saves period * num_engines / sec mmio reads,
> > >> and in a better case, when only engine busy samplers are active,
> > >> it enables us to not kick off the sampling timer at all.
> > > 
> > > Or you could inspect port_isset(execlists.port).
> > > You can avoid the mmio for this case also by just using HWSP. It's just
> > > that I never enabled busy tracking in isolation, so I always ended up
> > > using the mmio.
> > 
> > This would be for execlists only. I could change the main patch to do 
> > this, you think it is worth it?
> 
> You know, I wonder why we limit this by execlists?

Because there's only one ringbuffer for legacy and we don't virtualise
the contexts to have distinct buffers onto which we build the different
execlists. It's not impossible to emulate execlists on top; the only win
would be to remove the sema interengine waits (and external waits) and
replace them with cpu waits, unblocking the pipeline. Reducing
ringbuffer to execlist performance isn't appealing however.

> Is that because
> scheduler is working only for execlists and doesn't work for ringbuffers
> on older HW? But consider the following. If we don't have a scheduler,
> then we have FIFO queue and either hw semaphore or sw sync. For the
> userspace applications real execution and wait are not actually
> distinguishable: for him engine is busy, it either executes or it is
> stalled and can't execute anything else, thus - it is busy.

I've used the sema stall notification to reduce said stalls and allow
greater parallelism, knowing that they are caused by inter-engine
dependencies.

The question is whether knowing e.g. global statistics on the number of
requests waiting for external fences is interesting, or just the
runnable length. For the application, the question is more or less what
is the length of the dependency for this batch -- how early can I expect
this to run? Such questions will vary based on the scheduler policy.

> From this perspective we can consider to extend what we do currently for
> execlists to cover FIFO ringbuffers.

You mean how to extend the ringbuffer counters to cover execlists, and
beyond.

> How do you think? Other metrics
> like SEMA or WAIT would be second level of details and will mean
> distribution of BUSY time: we spent SEMA time on semaphore or WAIT time
> on the wait and actively ran something BUSY-SEMA(WAIT) time.

That's what they mean currently (and how I tried to display them in
overlay).

> By the way, with the BDW+ and execlists, is my understanding right that
> we report WAIT metric and SEMA is always 0?

WAIT is still used under execlists. SEMA I'm hopping that MI_SEMA_WAIT |
POLL results in the RING_CTL bit being asserted, that will be useful for
Vulkan, for example.
-Chris