[Intel-gfx] [RFC v2 0/6] Queued/runnable/running engine stats
Chris Wilson
chris at chris-wilson.co.uk
Wed Jan 24 18:07:55 UTC 2018
Quoting Tvrtko Ursulin (2018-01-24 18:01:14)
>
> On 22/01/2018 18:52, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-01-22 18:43:52)
> >> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> >>
> >> Per-engine queue depths are an interesting metric for analyzing the system load
> >> and also for users who wish to use it to load balance their submissions based
> >> on it.
> >>
> >> In this version I have split the metrics into three separate counters:
> >>
> >> 1. QUEUED - From execbuf time to request being runnable - runnable meaning until
> >> dependencies have been resolved and fences signaled.
> >> 2. RUNNABLE - From runnable to running on the GPU.
> >> 3. RUNNING - Running on the GPU.
> >>
> >> When inspected with perf stat the output looks roughly like this:
> >>
> >> # time counts unit events
> >> 201.160490145 0.01 i915/rcs0-queued/
> >> 201.160490145 19.13 i915/rcs0-runnable/
> >> 201.160490145 2.39 i915/rcs0-running/
> >>
> >> The reported numbers are average queue depths for the last query period.
> >>
> >> Having split out metrics should be more flexible for all users, and it is still
> >> possible to fetch an atomic snapshot of all using the perf groups for those
> >> wanting to combine them.
> >>
> >> For users wanting instantanous numbers instead of averaged, we could potentially
> >> expose them using the query API Lionel is working on.
> >> (https://patchwork.freedesktop.org/series/36622/)
> >>
> >> For instance a query packet could look like:
> >>
> >> #define DRM_I915_QUERY_ENGINE_QUEUES 0x04
> >>
> >> struct drm_i915_query_engine_queues {
> >> __u8 class;
> >> __u8 instance
> >>
> >> __u8 pad[2];
> >>
> >> __u32 queued;
> >> __u32 runnable;
> >> __u32 running;
> >> };
> >>
> >> I also have patches to expose this via intel-gpu-top, using the perf API.
> >
> > Can you stick a ewma loadavg just after the hostname in intel-gpu-overlay,
> > pretty please? :)
>
> Sure, just one period and all three counters aggregated?
Hmm, just runnable + running I think matches loadavg best. (For the cpu
that is the number of tasks in the runqueue.) I think having the 1s,
30s, 15m figures would be useful but they can be computed in userspace
from the single (combined) sampler.
But the problem with just runnable + running is that we don't see those
inter-engine dependencies so clearly (but it does hide inter-device waits
etc), so I don't know.
-Chris
More information about the Intel-gfx
mailing list