[RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU

Wed May 27 08:39:14 PDT 2015

On Thu, May 21, 2015 at 12:17:48AM +0100, Robert Bragg wrote:
> >
> > So for me the 'natural' way to represent this in perf would be through
> > event groups. Create a perf event for every single event -- yes this is
> > 53 events.
> 
> So when I was first looking at this work I had considered the
> possibility of separate events, and these are some of the things that
> in the end made me forward the hardware's raw report data via a single
> event instead...
> 
> There are 100s of possible B counters depending on the MUX
> configuration + fixed-function logic in addition to the A counters.

There's only 8 B counters, what there is 100s of is configurations, and
that's no different from any other PMU. There's 100s of events to select
on the CPU side as well - yet only 4 (general purpose) counters (8 if
you disable HT on recent machines) per CPU.

Of course, you could create more than 8 events at any one time, and then
perf will round-robin the configuration for you.

> A
> choice would need to be made about whether to expose events for the
> configurable counters that aren't inherently associated with any
> semantics, or instead defining events for counters with specific
> semantics (with 100s of possible counters to define). The later would
> seem more complex for userspace and the kernel if they both now have
> to understand the constraints on what counters can be used together.

Again, no different from existing PMUs. The most 'interesting' part of
any PMU driver is event scheduling.

The explicit design of perf was to put event scheduling _in_ the kernel
and not allow userspace direct access to the hardware (as previous PMU
models did).

> I
> guess with either approach we would also need to have some form of
> dedicated group leader event accepting attributes for configuring the
> state that affects the group as a whole, such as the counter
> configuration (3D vs GPGPU vs media etc). 

That depends a bit on how flexible you want to switch between these
modes; with the cost being in the 100s of MMIO register writes, it might
just not make sense to make this too dynamic.

An option might be to select the mode through the PMU driver's sysfs
files. Another option might be to expose the B counters 3 times and have
each set be mutually exclusive. That is, as soon as you've created a 3D
event, you can no longer create GPGPU/Media events.

> I'm not sure where we would
> handle the context-id + drm file descriptor attributes for initiating
> single context profiling but guess we'd need to authenticate each
> individual event open.

Event open is a slow path, so that should not be a problem, right?

> It's not clear if we'd configure the report
> layout via the group leader, or try to automatically choose the most
> compact format based on the group members. I'm not sure how pmus
> currently handle the opening of enabled events on an enabled group but

With the PERF_FORMAT_GROUP layout changing in-flight. I would recommend
not doing that -- decoding the output will be 'interesting' but not
impossible.

> I think there would need to be limitations in our case that new
> members can't result in a reconfigure of the counters if that might
> loose the current counter values known to userspace.

I'm not entirely sure what you mean here.

> From a user's pov, there's no real freedom to mix and match which
> counters are configured together, and there's only some limited
> ability to ignore some of the currently selected counters by not
> including them in reports.

It is 'impossible' to create a group that is not programmable. That is,
the pmu::event_init() call _should_ verify that the addition of the
event to the group (as given by event->group_leader) is valid.

> Something to understand here is that we have to work with sets of
> pre-defined MUX + fixed-function logic configurations that have been
> validated to give useful metrics for specific use cases, such as
> benchmarking 3D rendering, GPGPU or media workloads.

This is fine; as stated above. Since these are limited pieces of
'firmware' which are expensive to load, you don't have to do a fully
dynamic solution here.

> As it is currently the kernel doesn't need to know anything about the
> semantics of individual counters being selected, so it's currently
> convenient that we can aim to maintain all the counter meta data we
> have in userspace according to the changing needs of tools or drivers
> (e.g. names, descriptions, units, max values, normalization
> equations), de-coupled from the kernel, instead of splitting it
> between the kernel and userspace.

And that fully violates the design premise of perf. The kernel should be
in control of resources, not userspace.

If we'd have put userspace in charge, we could now not profile the same
task by two (or more) different observers. But with kernel side counter
management that is no problem at all.

> A benefit of being able to change the report size is to reduce memory
> bandwidth usage that can skew measurements. It's possible to request
> the gpu to write out periodic snapshots at a very high frequency (we
> can program a period as low as 160 nanoseconds) and higher frequencies
> can start to expose some interesting details about how the gpu is
> utilized - though with notable observer effects too. How careful we
> are to not waste bandwidth is expected to determine what sampling
> resolutions we can achieve before significantly impacting what we are
> measuring.
> 
> Splitting the counters up looked like it could increase the bandwidth
> we use quite a bit. The main difference comes from requiring 64bit
> values instead of the 32bit values in our raw reports. This can be
> offset partly since there are quite a few 'reserved'/redundant A
> counters that don't need forwarding. As an example in the most extreme
> case, instead of an 8 byte perf_event_header + 4byte raw_size + 256
> byte reports + 4 byte padding every 160ns ~= 1.5GB/s, we might have 33
> A counters (ignoring redundant ones) + 16 configurable counters = 400

In that PDF there's only 8 configurable 'B' counters.

> byte struct read_format (using PERF_FORMAT_GROUP) + 8 byte
> perf_event_header every 160ns ~= 2.4GB/s. On the other hand though we
> could choose to forward only 2 or 3 counters of interest at these high
> frequencies which isn't possible currently.

Right; although you'd have to spend some cpu cycles on the format shift.
Which might make it moot again. That said; I would really prefer we
start out with trying to make the generic format stuff work before
trying to come up with special case hacks.

> > Use the MMIO reads for the regular read() interface, and use a hrtimer
> > placing MI_REPORT_PERF_COUNT commands, with a counter select mask
> > covering the all events in the current group, for sampling.
> 
> Unfortunately due to the mmio limitations and the need to relate
> counters I can't imagine many use cases for directly accessing the
> counters individually via the read() interface.

Fair enough.

> MI_REPORT_PERF_COUNT commands are really only intended for collecting
> reports in sync with a command stream. We are experimenting currently
> with an extension of my PMU driver that emits MI_REPORT_PERF_COUNT
> commands automatically around the batches of commands submitted by
> userspace so we can do a better job of filtering metrics across many
> gpu contexts, but for now the expectation is that the kernel shouldn't
> be emitting MI_REPORT_PERF_COUNT commands. We emit
> MI_REPORT_PERF_COUNT commands within Mesa for example to implement the
> GL_INTEL_performance_query extension, at the start and end of a query
> around a sequence of commands that the application is interested in
> measuring.

I will try and read up on the GL_INTEL_performance_query thing.

> > You can use the perf_event_attr::config to select the counter (A0-A44,
> > B0-B7) and use perf_event_attr::config1 (low and high dword) for the
> > corresponding CEC registers.
> >
> 
> Hopefully covered above, but since the fixed-function state is so
> dependent on the MUX configuration I think it currently makes sense to
> treat the MUX plus logic state (including the CEC state) a tightly
> coupled unit.

Oh wait, the A counters are also affected by the MUX programming!
That wasn't clear to me before this.

Same difference though; you could still do either the sysfs aided flips
or the mutually exclusive counter types to deal with this; all still
assuming reprogramming all that is 'expensive'.

> The Flexible EU Counters for Broadwell+ could be more amenable to this
> kind of independent configuration, as I don't believe they are
> dependant on the MUX configuration.

That sounds good :-)

> One idea that's come up a lot though is having the possibility of
> being able to configure an event with a full MUX + fixed-function
> state description.

Right; seeing how there's only a very limited number of these MUX
programs validated (3D, GPGPU, Media, any others?) this should be
doable. And as mentioned before, you could disable the other types once
you create one in order to limit the reprogramming thing.

> >
> > This does not require random per driver ABI extentions for
> > perf_event_attr, nor your custom output format.
> >
> > Am I missing something obvious here?
> 
> Definitely nothing 'obvious' since the current documentation is
> notably incomplete a.t.m, but I don't think we were on the same page
> about how the hardware works and our use cases.
> 
> Hopefully some of my above comments help clarify some details.

Yes, thanks!