[Intel-gfx] [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU
Peter Zijlstra
peterz at infradead.org
Tue May 19 07:53:37 PDT 2015
On Fri, May 15, 2015 at 02:07:29AM +0100, Robert Bragg wrote:
> On Fri, May 8, 2015 at 5:24 PM, Peter Zijlstra <peterz at infradead.org> wrote:
> > On Thu, May 07, 2015 at 03:15:43PM +0100, Robert Bragg wrote:
> >
> >> I've changed the uapi for configuring the i915_oa specific attributes
> >> when calling perf_event_open(2) whereby instead of cramming lots of
> >> bitfields into the perf_event_attr config members, I'm now
> >> daisy-chaining a drm_i915_oa_event_attr_t structure off of a single
> >> config member that's extensible and validated in the same way as the
> >> perf_event_attr struct. I've found this much nicer to work with while
> >> being neatly extensible too.
> >
> > This worries me a bit.. is there more background for this?
>
> Would it maybe be helpful to see the before and after? I had kept this
> uapi change in a separate patch for a while locally but in the end
> decided to squash it before sending out my updated series.
>
> Although I did find it a bit awkward with the bitfields, I was mainly
> concerned about the extensibility of packing logically separate
> attributes into the config members and had heard similar concerns from
> a few others who had been experimenting with my patches too.
>
> A few simple attributes I can think of a.t.m that we might want to add
> in the future are:
> - control of the OABUFFER size
> - a way to ask the kernel to collect reports at the beginning and end
> of batch buffers, in addition to periodic reports
> - alternative ways to uniquely identify a context to support tools
> profiling a single context not necessarily owned by the current
> process
>
> It could also be interesting to expose some counter configuration
> through these attributes too. E.g. on Broadwell+ we have 14 'Flexible
> EU' counters included in the OA unit reports, each with a 16bit
> configuration.
>
> In a more extreme case it might also be useful to allow userspace to
> specify a complete counter config, which (depending on the
> configuration) could be over 100 32bit values to select the counter
> signals + configure the corresponding combining logic.
>
> Since this pmu is in a device driver it also seemed reasonably
> appropriate to de-couple it slightly from the core perf_event_attr
> structure by allowing driver extensible attributes.
>
> I wonder if it might be less worrisome if the i915_oa_copy_attr() code
> were instead a re-usable utility perhaps maintained in events/core.c,
> so if other pmu drivers were to follow suite there would be less risk
> of a mistake being made here?
So I had a peek at:
https://01.org/sites/default/files/documentation/observability_performance_counters_haswell.pdf
In an attempt to inform myself of how the hardware worked. But the
document is rather sparse (and does not include broadwell).
So from what I can gather there's two ways to observe the counters,
through MMIO or trough the ring-buffer, which in turn seems triggered by
a MI_REPORT_PERF_COUNT command.
[ Now the document talks about shortcomings of this scheme, where the
MI_REPORT_PERF_COUNT command can only be placed every other command, but
a single command can contain so much computation that this is not fine
grained enough -- leading to the suggestion of a timer/cycle based
reporting, but that is not actually mentioned afaict ]
Now the MI_REPORT_PERF_COUNT can select a vector (Counter Select) of
which events it will write out.
This covers the regular 'A' counters. Is this correct?
Then there are the 'B' counters, which appear to be programmable through
the CEC MMIO registers.
These B events can also be part of the MI_REPORT_PERF_COUNT vector.
Right?
So for me the 'natural' way to represent this in perf would be through
event groups. Create a perf event for every single event -- yes this is
53 events.
Use the MMIO reads for the regular read() interface, and use a hrtimer
placing MI_REPORT_PERF_COUNT commands, with a counter select mask
covering the all events in the current group, for sampling.
Then obtain the vector values using PERF_SAMPLE_READ and
PERF_FORMAT_READ -- and use the read txn support to consume the
ring-buffer vectors instead of the MMIO registers.
You can use the perf_event_attr::config to select the counter (A0-A44,
B0-B7) and use perf_event_attr::config1 (low and high dword) for the
corresponding CEC registers.
This does not require random per driver ABI extentions for
perf_event_attr, nor your custom output format.
Am I missing something obvious here?
More information about the Intel-gfx
mailing list