[Intel-gfx] [RFC 0/6] Non perf based Gen Graphics OA unit driver

Tue Oct 20 13:16:41 PDT 2015

On Fri, Oct 16, 2015 at 10:43 AM, Peter Zijlstra <peterz at infradead.org> wrote:
> On Tue, Sep 29, 2015 at 03:39:03PM +0100, Robert Bragg wrote:
>> - We're bridging two complex architectures
>>
>>     To review this work I think it will be relevant to have a good
>>     general familiarity with Gen graphics (e.g. thinking about the OA
>>     unit's interaction with the command streamer and execlist
>>     scheduling) as well as our userspace architecture and how we're
>>     consuming OA data within Mesa to implement the
>>     INTEL_performance_query extension.
>>
>>     On the flip side here, its necessary to understand the perf
>>     userspace interface (for most this is hidden by tools so the details
>>     aren't common knowledge) as well as the internal design, considering
>>     that the PMU we're looking at seems to break several current design
>>     assumptions. I can only claim a limited familiarity with perf's
>>     design, just as a result of this work.
>
> Right; but a little effort and patience on both sides should get us
> there I think. At worst we'll both learn something new ;-)

I suppose I'm also concerned time is an important factor too. When it
comes to the OA metrics; we already have userspace tools that could be
more widely used by developers once we have an upstream interface.
Today perf isn't very well suited to our OA unit use case, and
although we may be able to change that - and I can try to help with
that - at this point I think I'd prefer not to block moving forward in
the mean time with the alternative i915 interface.

Although code-wise it didn't require any big changes to events/core to
get an initial perf based driver working for our use case, we have
raised a number of quite significant design questions and arguably cut
some corners, which could take a long time to resolve properly. I also
tend to think it's an open question at this stage whether it would
really be in everyone's interest to take perf in this direction
without a clear sense of the benefits it brings in comparison to the
complexity it may add.

It's also a bit awkward I had already started to move ahead with this
idea of upstreaming a non-perf based driver for the OA unit after
asking Daniel Vetter about this on IRC. There are some knock on
effects here too; Sourab Gupta is looking at building on this OA
driver and has now started adapting his work for this non-perf
approach.

>
>> - The current OA PMU driver breaks some significant design assumptions.
>>
>>     Existing perf pmus are used for profiling work on a cpu and we're
>>     introducing the idea of _IS_DEVICE pmus with different security
>>     implications, the need to fake cpu-related data (such as user/kernel
>>     registers) to fit with perf's current design, and adding _DEVICE
>>     records as a way to forward device-specific status records.
>
> There are more devices with counters on than GPUs, so I think it might
> make sense to look at extending perf to better deal with this.

I wonder if it could be good to look at exposing some of the mmio
accessible Gen graphics counters before tackling a more complex case
like the OA unit. We have a number of counters that could be
interesting to sample periodically via a hrtimer, that require no
configuration, are global (so no need to specify a gpu context) but as
they relate to the GPU an _IS_DEVICE pmu would still be appropriate.
Some of these seem like they could be better suited to being exposed
via perf than OA unit counters so they might be a helpful stepping
stone.

>
>>     The OA unit writes reports of counters into a circular buffer,
>>     without involvement from the CPU, making our PMU driver the first of
>>     a kind.
>
> Agreed, this is somewhat 'odd' from where we are today.
>
>>     Perf supports groups of counters and allows those to be read via
>>     transactions internally but transactions currently seem designed to
>>     be explicitly initiated from the cpu (say in response to a userspace
>>     read()) and while we could pull a report out of the OA buffer we
>>     can't trigger a report from the cpu on demand.
>>
>>     Related to being report based; the OA counters are configured in HW
>>     as a set while perf generally expects counter configurations to be
>>     orthogonal. Although counters can be associated with a group leader
>>     as they are opened, there's no clear precedent for being able to
>>     provide group-wide configuration attributes and no obvious solution
>>     as yet that's expected to be acceptable to upstream and meets our
>>     userspace needs.
>
> I'm not entirely sure what you mean with group-wide configuration
> attributes; could you elaborate?

Here I'm thinking of configuration details that conceptually relate to
a set of OA unit counters, not individual events/counters:

- The choice of 'metric set' which represents a MUX configuration +
boolean logic configuration for a set of counters that will be
included in the reports written by the OA unit.

- The OA unit exponent for periodic sampling applies to the whole group.

- The choice of report layout which the OA unit writes all the counters in.

- The choice to profile a single context or system-wide applies to the
group, as well as the specification of a file descriptor + context ID
in the single-context case.

>
>>     We currently avoid using perf's grouping feature
>>     and forward OA reports to userspace via perf's 'raw' sample field.
>>     This suits our userspace well considering how coupled the counters
>>     are when dealing with normalizing. It would be inconvenient to split
>>     counters up into separate events, only to require userspace to
>>     recombine them.
>
> So IF you were using a group, a single read from the leader can return
> you a vector of all values (PERF_FORMAT_GROUP), this avoids having to
> do that recombine.

Although recombining isn't necessary with _FORMAT_GROUP, this vector
layout could be similarly inconvenient for userspace...

afict we couldn't avoid also requesting the _ID to be included in the
vector, so instead of the 32/40bits per counter we would now have
16bytes per counter which would seem to loose some benefit from a
compact HW layout to minimize our use of memory bandwidth.

Userspace has to be aware that we're placing 32bit or 40bit values
within a 64bit field and the values will overflow as such.

As userspace receives reports it calculates deltas between sequential
reports to accumulate 64bit counters. For each event in the vector it
needs to lookup the index of the accumulated value, and a lookup based
on a u64 event id for each counter isn't as direct as with a rigid
report layout. (I don't know what assumptions can be made about the
vector ordering from one sample to the next, but maybe a fixed mapping
could be made after the first sample) When calculating the delta
userspace needs to be careful not to treat the vector values as 64bit
but as 32/40bit to account for overflow.

Mesa will still have to be able to handle the raw OA report layouts to
process reports collected via the command stream using
MI_REPORT_PERF_COUNT commands, so any alternative at least represents
some amount of extra layout handling code and needing to handle
different combinations of raw or vector layouts when calculating a
single delta.

>
> Another option would be to view the arrival of an OA vector in the
> datastream as an 'event' and generate a PERF_RECORD_READ in the perf
> buffer (which again can use the GROUP vector format).

Tbh I couldn't really figure out what PERF_RECORD_READ is intended for
- the userspace code I found currently ignores these.

The extensible sample design seems more appropriate, but it could be
good if samples were extensible by device drivers as an alternative to
packing data into the raw field. Sourab who's been building on my base
OA driver is exposing more data as part of samples, including a
context ID, and we were extending what we included in the raw field
(we wouldn't get this extra data via _RECORD_READ).

Conceptually the aim was to expose event specific sample flags, to
extend samples (mirroring the existing design for pre-defined sample
flags, except driver extensible). We just used the raw field as the
most convenient extension point to start with.

Extending the raw field has some difficulties though because
events/core currently only lets a pmu give a single raw data pointer
plus len which will be copied into the ring buffer, so to include more
than the OA report we'd have to copy the report into an intermediate
larger buffer. I'd been considering allowing a vector of data+len
values to be specified for copying the raw data.

>
>>     Related to counter orthogonality; we can't time share the OA unit,
>>     while event scheduling is a central design idea within perf for
>>     allowing userspace to open + enable more events than can be
>>     configured in HW at any one time.
>
> So we have other PMUs that cannot do this; Gen OA would not be unique in
> this. Intel PT for example only allows a single active event.
>
> That said; earlier today I saw:
>
>   https://www.youtube.com/watch?v=9J3BQcAeHpI&list=PLe6I3NKr-I4J2oLGXhGOeBMEjh8h10jT3&index=7
>
> where exactly this feature was mentioned as not fitting well into the
> existing GPU performance interfaces (GL_AMD_performance_monitor /
> GL_INTEL_performance_query).

I think Samuel was generally commenting that these Intel/AMD GL
extensions aren't a good fit for Nvidia hw since it's awkward to
abstract the need to update the MUX configuration in-flight so they
can derive higher level counters from more inputs than the HW can
expose at the same time. Unlike Intel and AMD, it looks like instead
of defining a GL extension for accessing performance counters Nvidia
has a perfkit api which can be used in conjuction with OpenGL, CUDA or
Direct3D which Samuel has also been implementing.

I'm not sure what exactly about the AMD/Intel GL extensions makes it
tricky to abstract round-robin updates of the MUXs internally vs what
perfkit offers but since we wouldn't expect to round-robin the OA
counter configuration the INTEL_performance_query spec authors
wouldn't have given that any consideration.

>
> So there is hardware (Nvidia) out there that does support this. Also
> mentioned was that this hardware has global and local counters, where
> the local ones are specific to a rendering context. That is not unlike
> the per-cpu / per-task stuff perf does.

For reference, OA counters can be viewed as global or local counters.
When opening an event an application can request a system-wide view vs
single-context view.

>
>>     The OA unit is not designed to
>>     allow re-configuration while in use. We can't reconfigure the OA
>>     unit without loosing internal OA unit state which we can't access
>>     explicitly to save and restore. Reconfiguring the OA unit is also
>>     relatively slow, involving ~100 register writes. From userspace Mesa
>>     also depends on a stable OA configuration when emitting
>>     MI_REPORT_PERF_COUNT commands and importantly the OA unit can't be
>>     disabled while there are outstanding MI_RPC commands lest we hang
>>     the command streamer.
>
> Right; see the PERF_PMU_CAP_EXCLUSIVE stuff.

This looks like it's incompatible with the group mechanism re: the
other question of exposing OA reports via the grouping mechanims vs
raw sample data.

>
>> - We may be making some technical compromises a.t.m for the sake of
>>   using perf.
>>
>>     perf_event_open() requires events to either relate to a pid or a
>>     specific cpu core, while our device pmu relates to neither.  Events
>>     opened with a pid will be automatically enabled/disabled according
>>     to the scheduling of that process - so not appropriate for us.
>
> Right; the traditional cpu/pid mapping doesn't work well for devices;
> but maybe, with some work, we can create something like that
> global/local render context from it; although I've no clue what form
> that would need at this time.

Currently the way we identify a context is with the combination of a
file descriptor and a u32 context handle (only unique to that fd). The
use of a drm file descriptor here also relates to the security model
for accessing single context metrics, in that we don't require root
privileges to profile a context associated with a file descriptor that
the process has open.

Recently we've also been working with a globally unique context ID
(not upstream yet) and it could be interesting to also allow single
context profiling given a global ID, without any fd (comparable to
passing a pid to perf_event_open). This form of selection would
probably require root privileges by default.

>
>>     When
>>     an event is related to a cpu id, perf ensures pmu methods will be
>>     invoked via an inter process interrupt on that core. To avoid
>>     invasive changes our userspace opens OA perf events for a specific
>>     cpu.
>
> Some of that might still make sense in the sense that GPUs are subject
> to the NUMA topology of machines. I would think you would want most
> such things to be done on the node the device is attached to.
>
> Granted, this might not be a concern for Intel graphics, but it might be
> relevant for some of the discrete GPUs.

I'm not sure whether Nouveau takes this kind of idea into consideration or not.

>
>> - I'm not confident our use case benefits much from building on perf:
>>
>>     We aren't using existing perf based tooling with our PMU. Existing
>>     tools typically assume you're profiling work running on a cpu, e.g.
>>     expecting samples to be associated with instruction pointers and
>>     user/kernel registers and aiming to represent metrics in relation
>>     to application source code. We're forwarding fake register values
>>     and userspace needs needs to know how to decode the raw OA reports
>>     before anything can be reported to a user.
>>
>>     With the buffering done by the OA unit I don't think we currently
>>     benefit from perf's mmapped circular buffer interface. We already
>>     have a decoupled producer and consumer and since we have to copy out
>>     of the OA buffer, it would work well for us to hide that copy in
>>     a simpler read() based interface.
>>
>>
>> - Logistically it might be more practical to contain this to the
>>   graphics stack.
>>
>>     It seems fair to consider that if we can't see a very compelling
>>     benefit to building on perf, then containing this work to
>>     drivers/gpu/drm/i915 may simplify the review process as well as
>>     future maintenance and development.
>
>> Peter; I wonder if you would tend to agree too that it could make sense
>> for us to go with our own interface here?
>
> Sorry this took so long; this wanted a well considered response and
> those tend to get delayed in light of 'urgent' stuff.
>
> While I can certainly see the pain points and why you would rather not
> deal with them. I think it would make Linux a better place if we could
> manage to come up with a generic interface that would work for 'all'
> GPUs (and possibly more devices).

I'm not sure a generic multi-vendor interface for gpu metrics is
what's at stake or really needed here. The perf driver I worked on
didn't attempt to provide a generic interface: E.g. counter
normalizing is relatively complex and HW specific but makes sense to
leave to userspace. Normalizing may combine periodic OA reports (from
perf) with MI_REPORT_PERF_COUNT reports got via a command stream  (not
involving perf). Normalizing might also be done offline or remotely
for a reduced runtime overhead).

Although it could be good to work more closely with Samuel on this, I
did raise the idea of using perf with him last year but I think his
priority is still with accessing metrics synchronized with the command
stream. We were looking at a perf-like interface to help support the
periodic sampling feature of the OA unit, but I'm not aware that
Nvidia HW has the same kind of feature.

I think it's more the norm that GPU drivers don't share a lot of
kernel interfaces given the significant architectural differences
between vendors. A large proportion of a GPU driver is typically in
userspace and interface standardisation is done at the OpenGL/CL level
more so than with kernel interfaces. While perf may be adaptable to
help access some gpu/device metrics it's not in a good position to be
involved with capturing metrics via command streams in sync with
specific commands submitted in userspace so I wouldn't expect to aim
for perf to be the sole interface for gpu metrics.

Okey, sorry to still be inclined to prioritize the non-perf interface
at this stage, at least for OA unit metrics. That said; I think it may
still be practical to look at exposing other mmio accessible counters
of Gen graphics via perf and maybe these could serve as a stepping
stone before attempting supporting the OA unit via perf.

I'm still open to trying to help with adapting perf in this direction
if you feel it could be worthwhile, but would like to decouple the
effort for now.

Regards,
- Robert