[Mesa-dev] [RFC 0/6] i965: INTEL_performance_query re-work

Wed May 6 07:43:59 PDT 2015

On 06/05/15 17:02, Daniel Vetter wrote:
> On Wed, May 06, 2015 at 10:36:15AM +0200, Samuel Pitoiset wrote:
>>
>> On 05/06/2015 02:53 AM, Robert Bragg wrote:
>>> As we've learned more about the observability capabilities of Gen
>>> graphics we've found that it's not enough to only try and configure the
>>> OA unit from userspace without any dedicated support from the kernel.
>> Hi Robert,
>>
>> Yeah, this is the same idea for performance counters on Nouveau.
>>
>> We also need to implement a dedicated support from the kernel for
>> configuring/sampling hardware performance counters. Then, we can
>> expose a list of available counters through a set of ioctls. Thus, mesa
>> configures a hardware event by sending its "configuration" to the kernel.
> Well the big difference here is that i915 won't have it's own set of
> ioctls, but instead reuses the perfectly capable perf subsystem. Rob has
> done some minor additions (which are acked by perf maintainers already) to
> make that feasible.
>
> Imo that's the way to go instead of reinventing a new abi wheel in each
> driver.
>
>>> As it is currently the i965 backends for both AMD_performance_monitor
>>> and INTEL_performance_query aren't able to report normalized metrics
>>> useful to application developers due to the limitations of configuring
>>> the OA unit from userspace via LRIs.
>>>
>>> More recently we've developed a perf PMU (performance monitoring unit)
>>> driver within the drm i915 driver ("i915_oa") that lets userspace
>>> configure and open an event fd via the perf_event_open syscall which
>>> provides us a more complete interface for configuring the Gen graphics
>>> OA unit.
>>>
>>> With help from the kernel we can support periodic sampling (where the
>>> hardware writes reports into a gpu mapped circular buffer that we can
>>> forward as perf samples), we can deal with the clock gating + PM
>>> limitations imposed by the observability hw and also manage + maintain
>>> the selection of performance counters.
>>>
>>> The perf_event_open(2) man page is a good starting point for anyone
>>> wanting to learn about the Linux perf interface. Something to beware of
>>> is that there's currently no precedent upstream for exposing device
>>> metrics via a perf PMU and although early feedback was sought for this
>>> work, some of this may be subject to change based on feedback from the
>>> core perf maintainers as well as the i915 drm driver maintainers.
>> Performance counters on Nouveau won't be exposed (in the near future)
>> by perf since they need to be tied to the command stream of the GPU,
>> and perf only works with ioctl calls.
> Well same for i915, except that there's the timer-based one on top. I
> still think using perf to configure the pmu and then submitting the in-cs
> sample points normally from mesa is the way to go, even if your hw lacks a
> timer-based sample functionality.

This is not really acceptable for nouveau because, unlike Intel GPUs, 
only a very limited amount of counters can be sampled/configured at the 
same time (4 per clock domain on some models). It is thus very important 
to support fast round-robbin between them, like mandated by the nvidia 
perfkit SDK. Using the perf interface to set up the counters and then 
sampling them from software methods in the command stream will lead to 
every odd frame not being able to sample any data unless we allow for 
some major stalling of the pipeline which would render the whole 
profiling very painfully slow.

The solution nouveau chose was to have an ioctl to create an handle 
hiding all the configuration needed for monitoring a particular event 
and then reference this handle in the command stream whenever this event 
needs to be configured/monitored. The kernel will thus program the 
counters as late as possible. This allows nouveau to always being able 
to use all the counters at the same time and still change the 
configuration after each draw call without stalling as much the 
pipeline. Perf's interface is not suited to do that, IIRC.

Now speaking about the ioctl, nouveau will not need to introduce another 
ioctl as there is an interface called nvif that is meant to expose the 
internal objects of nouveau to the userspace through a single ioctl (or 
to a VM through hypercalls). The userspace code for nvif has been in 
review for quite a lot of time. I hope it will be reviewed more publicly 
soon! I may need to step up and have a look at it though.

At least on some chipsets, it is possible to have a timer-based polling 
but the frequency is not configurable.

I hope this makes our plans more understandable.