[Intel-gfx] [RFC 0/8] Introduce framework to forward asynchronous OA counter

Mon Jun 22 02:50:11 PDT 2015

From: Sourab Gupta <sourab.gupta at intel.com>

Cc: Robert Bragg <robert at sixbynine.org>,
    Zhenyu Wang <zhenyuw at linux.intel.com>,
    Jon Bloomfield <jon.bloomfield at intel.com>,
    Peter Zijlstra <a.p.zijlstra at chello.nl>,
    Jabin Wu <jabin.wu at intel.com>,
    Insoo Woo <insoo.woo at intel.com>

This patch series adds support for capturing OA counter snapshots at
asynchronous points by inserting MI_REPORT_PERF_COUNT commands into CS,
and forwarding these snapshots to userspace using perf interface.
These commands can be inserted at asynchronous points during workload
execution for e.g. at batch buffer boundaries.

This work is based on Robert Bragg's perf event framework, the patches for
which were floated earlier. Please see the link below:

http://lists.freedesktop.org/archives/intel-gfx/2015-May/066102.html

The perf event framework enabled capture of periodic OA counter snapshots
by configuring the OA unit during perf event init. The raw OA reports generated
by HW are then forwarded to userspace using perf apis.

There may be usecases wherein we need more than the periodic OA capture
functionality which is supported by perf_event currently. Few such usecases are:
    - Ability to capture system wide metrics. The reports captured should be
      able to be mapped back to individual contexts.
    - Ability to inject tags for work, into the reports. This provides
      visibility into the multiple stages of work within single context.

This framework may also be seen as a way to overcome a limitation of Haswell,
which doesn't write out a context ID with OA reports and handling this in the
kernel makes sense when we plan for compatibility with Broadwell which doesn't
include context id in reports.

This can be achieved by inserting the commands into the ring to dump the OA
counter snapshots at some asynchronous points during workload execution.
The reports generated can have an additional footer appended for capturing the
metadata information such as ctx id, pid, tags, etc.
The specific issue of counter wraparound due to large batchbuffers can be
subverted by using them in conjunction with periodic OA snapshots. Such per-BB
data can give useful information to userspace tools to analyze performance and
timing information at batchbuffer level.

An application intending to profile its own contexts can do so by submitting
the MI_REPORT_PERF_COUNT commands into the CS from the userspace itself.
But consider the usecase of a system wide GPU profiler tool which needs the 
data for all the workloads being scheduled on GPU globally. The relative
complexity of doing this in kernel is significantly less than supporting such
a usecase through userspace.

This framework is intended to feed into the requirement of such system wide
GPU profilers, which may further utilize this data for usecases such as
performance analysis (at a global level), identifying optimization scenarios
for improving GPU utilization, CPU vs GPU timing analysis, etc. Again, this is
made possible by presence of metadata information with individual reports, which
is enabled by this framework.

One such system wide GPU profiler tool is MVP(Modular Video Profiler) tool,
used by media team for profiling media workloads. (Talks in progress for
open sourcing of this tool)

The current implementation approach is to forward these samples through the
same PERF_SAMPLE_RAW sample type, as being done for periodic samples, with an
additional footer appended for metadata information. The userspace can then
distinguish these samples by filtering out on the basis of sample size.
One of the other approaches being contemplated right now is creating seperate
sample types to handle these different kind of samples. There would be different
fd's associated with these different sample types, though they can be a part of
one event group. The userspace can listen to either or both these sample types
while specifying event attributes during event init.
But right now, I'm seeing this work as a future refinement, based on acceptance
of general framework as such. I'm looking, as of now, to get the feedback on
these initial patches, w.r.t. the usage of perf apis and the interaction with
i915.

Another feature introduced in these patches is perftag. PerfTag is a mechanism,
whereby the reports collected are marked with a perfTag passed by userspace
during the execbuffer call. This way the userspace tool can associate the
reports collected with the corresponding execbuffers. This satifies the
requirement to have visibility into multiple stages (i.e. execbuffers) lying
within a single context. 
For e.g. for the media pipeline, CodecHAL encoding stage has a single context,
and involves multiple stages such as Scaling, ME, MBEnc, PAK for which there
are seperate execbuffer calls. There is a need to have the granularity of these
multiple stages of a context for the reports generated. The presence of a
perftag in report metadata fulfills this requirement. This is done right now by
using rsvd2 field of execbuffer ioctl structure, and introducing an additional
bitfield in flags to inform KMD of the same.

One of the pre-requisite for this work is presence of globally unique context
id. The context id right now is specific to drm file instance. As such, it
can't uniquely be used to associate the reports generated with the corresponding
context scheduled from userspace in a global way. In absence of globally
unique context id, other metadata such as pid/tags in conjunction with ctx id
may be used to associate reports with their corresponding contexts.

The first patch in the series proposes a way of implementing globally unique
context id.  I'm looking for comments on the pros & cons of having global ctx
id. This implementation can be refined upon if this approach is acceptable.
The subsequent patches introduce the asynchronous OA capture mode and the
mechanism to forward these snapshots using perf.

This patch set currently supports Haswell. Gen8+ support can be added when
the basic framework is agreed upon.

Sourab Gupta (8):
  drm/i915: Have globally unique context ids, as opposed to drm file
    specific
  drm/i915: Introduce mode for asynchronous capture of OA counters
  drm/i915: Add the data structures for async OA capture mode
  drm/i915: Add mechanism for forwarding async OA counter snapshots
    through perf
  drm/i915: Wait for GPU to finish before event stop, in async OA
    counter mode
  drm/i915: Routines for inserting OA capture commands in the ringbuffer
  drm/i915: Add commands in ringbuf for OA snapshot capture across
    Batchbuffer boundaries
  drm/i915: Add perfTag support for OA counter reports

 drivers/gpu/drm/i915/i915_debugfs.c        |   2 +-
 drivers/gpu/drm/i915/i915_dma.c            |   1 +
 drivers/gpu/drm/i915/i915_drv.h            |  47 ++-
 drivers/gpu/drm/i915/i915_gem_context.c    |  53 +++-
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |   9 +
 drivers/gpu/drm/i915/i915_oa_perf.c        | 451 +++++++++++++++++++++++++++--
 include/uapi/drm/i915_drm.h                |  24 +-
 7 files changed, 538 insertions(+), 49 deletions(-)

-- 
1.8.5.1