[Intel-gfx] [RFC 0/6] Non perf based Gen Graphics OA unit driver

Tue Sep 29 07:39:03 PDT 2015

After some recent progress enabling the Observation Architecture unit
for Gen8+, we can hopefully paint a fairly complete picture of the
requirements for supporting the unit from Haswell to Skylake and so
I'm looking again at the challenges in upstreaming this work.

Considering this, it looked like it could be worthwhile experimenting
with a non-perf based driver for the OA unit and I'm hoping to explain
why and how it went as well as request some feedback on whether we
should aim to move forward without perf.

Besides the patches forwarded here, a branch can be found for reference
here:

  https://github.com/rib/linux - wip/rib/oa-without-perf branch

I created corresponding branches for Mesa and GPU Top to test this here
(same branch names):

  https://github.com/rib/mesa
  https://github.com/rib/gputop

Here I've only included the patches up to an initial Haswell driver,
although the wip/rib/oa-without-perf branch on github includes support
for Gen8+. Please let me know if it would be helpful to forward more.

At this point I have two drivers at feature parity; one based on perf,
one not. Technically they're very similar and the patches are split to
hopefully be quite comparable. My latest perf-based work is under
wip/rib/oa-next branches in the above repos.

So, these are the concerns I have a.t.m about upstreaming this work:

- We're bridging two complex architectures

    To review this work I think it will be relevant to have a good
    general familiarity with Gen graphics (e.g. thinking about the OA
    unit's interaction with the command streamer and execlist
    scheduling) as well as our userspace architecture and how we're
    consuming OA data within Mesa to implement the
    INTEL_performance_query extension.

    On the flip side here, its necessary to understand the perf
    userspace interface (for most this is hidden by tools so the details
    aren't common knowledge) as well as the internal design, considering
    that the PMU we're looking at seems to break several current design
    assumptions. I can only claim a limited familiarity with perf's
    design, just as a result of this work.

- Limited documentation for the OA unit:

    Not unique to the OA unit but I think having a driver that extends
    outside of the graphics stack, into the core perf infrastructure
    probably requires more comprehensive HW + graphics stack
    documentation for non drm/i915 developers.  Earlier RFC discussions
    were hampered somewhat by limited documentation.  Improved
    documentation is always desirable, but of course it can also take a
    significant amount of time and effort while some key aspects
    (notably the PRMs) aren't directly under my control.

- The current OA PMU driver breaks some significant design assumptions.

    Existing perf pmus are used for profiling work on a cpu and we're
    introducing the idea of _IS_DEVICE pmus with different security
    implications, the need to fake cpu-related data (such as user/kernel
    registers) to fit with perf's current design, and adding _DEVICE
    records as a way to forward device-specific status records.

    The OA unit writes reports of counters into a circular buffer,
    without involvement from the CPU, making our PMU driver the first of
    a kind.

    Given the way we periodically forward data from the OA buffer to
    perf's buffer, these bursts of sample writes look to perf like we're
    sampling too fast and so it throttles us.

    Perf supports groups of counters and allows those to be read via
    transactions internally but transactions currently seem designed to
    be explicitly initiated from the cpu (say in response to a userspace
    read()) and while we could pull a report out of the OA buffer we
    can't trigger a report from the cpu on demand.

    Related to being report based; the OA counters are configured in HW
    as a set while perf generally expects counter configurations to be
    orthogonal. Although counters can be associated with a group leader
    as they are opened, there's no clear precedent for being able to
    provide group-wide configuration attributes and no obvious solution
    as yet that's expected to be acceptable to upstream and meets our
    userspace needs. We currently avoid using perf's grouping feature
    and forward OA reports to userspace via perf's 'raw' sample field.
    This suits our userspace well considering how coupled the counters
    are when dealing with normalizing. It would be inconvenient to split
    counters up into separate events, only to require userspace to
    recombine them. For Mesa it's also convenient to be forwarded raw,
    periodic reports for combining with the raw reports it captures
    using MI_REPORT_PERF_COUNT commands.

    Related to counter orthogonality; we can't time share the OA unit,
    while event scheduling is a central design idea within perf for
    allowing userspace to open + enable more events than can be
    configured in HW at any one time. The OA unit is not designed to
    allow re-configuration while in use. We can't reconfigure the OA
    unit without loosing internal OA unit state which we can't access
    explicitly to save and restore. Reconfiguring the OA unit is also
    relatively slow, involving ~100 register writes. From userspace Mesa
    also depends on a stable OA configuration when emitting
    MI_REPORT_PERF_COUNT commands and importantly the OA unit can't be
    disabled while there are outstanding MI_RPC commands lest we hang
    the command streamer.

- We may be making some technical compromises a.t.m for the sake of
  using perf.

    perf_event_open() requires events to either relate to a pid or a
    specific cpu core, while our device pmu relates to neither.  Events
    opened with a pid will be automatically enabled/disabled according
    to the scheduling of that process - so not appropriate for us. When
    an event is related to a cpu id, perf ensures pmu methods will be
    invoked via an inter process interrupt on that core. To avoid
    invasive changes our userspace opens OA perf events for a specific
    cpu. This is workable but it means the majority of the OA driver now
    runs in atomic context, including all OA report forwarding, which
    isn't really necessary in our case and seems to make our locking
    requirements somewhat complex as we handle the interaction with the
    rest of the i915 driver.

- I'm not confident our use case benefits much from building on perf:

    We aren't using existing perf based tooling with our PMU. Existing
    tools typically assume you're profiling work running on a cpu, e.g.
    expecting samples to be associated with instruction pointers and
    user/kernel registers and aiming to represent metrics in relation
    to application source code. We're forwarding fake register values
    and userspace needs needs to know how to decode the raw OA reports
    before anything can be reported to a user.

    With the buffering done by the OA unit I don't think we currently
    benefit from perf's mmapped circular buffer interface. We already
    have a decoupled producer and consumer and since we have to copy out
    of the OA buffer, it would work well for us to hide that copy in
    a simpler read() based interface.

- Logistically it might be more practical to contain this to the
  graphics stack.

    It seems fair to consider that if we can't see a very compelling
    benefit to building on perf, then containing this work to
    drivers/gpu/drm/i915 may simplify the review process as well as
    future maintenance and development.

About the initial non-perf driver:

Structurally it's very similar to the perf based implementation. The
userspace interface is inspired by perf and adds a
DRM_IOCTL_I915_PERF_OPEN ioctl that's conceptually comparable to
perf_event_open() returning an fd that userspace can poll() and read()
samples from. The fds also supports I915_PERF_IOCTL_ENABLE/DISABLE
ioctls much like perf.

When opening an event; users specify an event type (enum based a.t.m
like perf's built in event types) and a pointer to an event-specific
attributes structure which is extensible in the same way as struct
i915_perf_event_attr. Metrics can optionally be opened for a specific
GPU context (comparable to passing a pid to perf_event_open()) and the
contents of samples can be controlled via a sample_flags member as with
perf.

Userspace collects samples via read() which writes (only complete)
records to the user's given buffer. Records have a type + size
header equivalent to struct i915_perf_event_header.

Updating Mesa and GPU Top to experiment with this was straightforward
given the similarity to the perf interface.  The main difference is that
it only supports forwarding metrics via read()s instead of an mmaped
circular buffer. As mentioned above, I think that suits this well, and
requires no additional copying of data. I think the userspace code has
ended up being a little simpler too.

Overall the driver currently isn't much more code than with perf (~200
lines).

Personally my gut feeling a.t.m, is that we should aim to move forward
independent from perf.

I'd really appreciate some feedback from others on this though.

Daniel and Chris; although I think it made sense at the outset to try
and use perf, in light of the above would you be open to a non-perf
based driver for the OA unit?

Peter; I wonder if you would tend to agree too that it could make sense
for us to go with our own interface here?

Kind Regards,
Robert

Robert Bragg (6):
  drm/i915: Add i915 perf infrastructure
  drm/i915: rename OACONTROL GEN7_OACONTROL
  drm/i915: Add static '3D' Haswell OA unit config
  drm/i915: Add i915 perf event for Haswell OA unit
  drm/i915: Add dev.i915.perf_event_paranoid sysctl option
  drm/i915: add oa_event_min_timer_exponent sysctl

 drivers/gpu/drm/i915/Makefile           |    4 +
 drivers/gpu/drm/i915/i915_cmd_parser.c  |    4 +-
 drivers/gpu/drm/i915/i915_dma.c         |    7 +
 drivers/gpu/drm/i915/i915_drv.h         |  136 ++++
 drivers/gpu/drm/i915/i915_gem_context.c |   23 +-
 drivers/gpu/drm/i915/i915_oa_hsw.c      |   98 +++
 drivers/gpu/drm/i915/i915_oa_hsw.h      |   36 +
 drivers/gpu/drm/i915/i915_perf.c        | 1201 +++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_reg.h         |  340 ++++++++-
 include/uapi/drm/i915_drm.h             |  125 ++++
 10 files changed, 1967 insertions(+), 7 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/i915_oa_hsw.c
 create mode 100644 drivers/gpu/drm/i915/i915_oa_hsw.h
 create mode 100644 drivers/gpu/drm/i915/i915_perf.c

-- 
2.5.2