[Intel-gfx] [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit

Robert Bragg robert at sixbynine.org
Wed May 4 11:15:22 UTC 2016


On Wed, May 4, 2016 at 10:04 AM, Martin Peres <martin.peres at linux.intel.com>
wrote:

> On 03/05/16 22:34, Robert Bragg wrote:
>
>> Sorry for the delay replying to this, I missed it.
>>
>
> No worries!
>
>
>> On Sat, Apr 23, 2016 at 11:34 AM, Martin Peres <martin.peres at free.fr
>> <mailto:martin.peres at free.fr>> wrote:
>>
>>     On 20/04/16 17:23, Robert Bragg wrote:
>>
>>         Gen graphics hardware can be set up to periodically write
>>         snapshots of
>>         performance counters into a circular buffer via its Observation
>>         Architecture and this patch exposes that capability to userspace
>>         via the
>>         i915 perf interface.
>>
>>         Cc: Chris Wilson <chris at chris-wilson.co.uk
>>         <mailto:chris at chris-wilson.co.uk>>
>>         Signed-off-by: Robert Bragg <robert at sixbynine.org
>>         <mailto:robert at sixbynine.org>>
>>         Signed-off-by: Zhenyu Wang <zhenyuw at linux.intel.com
>>         <mailto:zhenyuw at linux.intel.com>>
>>
>>         ---
>>           drivers/gpu/drm/i915/i915_drv.h         |  56 +-
>>           drivers/gpu/drm/i915/i915_gem_context.c |  24 +-
>>           drivers/gpu/drm/i915/i915_perf.c        | 940
>>         +++++++++++++++++++++++++++++++-
>>           drivers/gpu/drm/i915/i915_reg.h         | 338 ++++++++++++
>>           include/uapi/drm/i915_drm.h             |  70 ++-
>>           5 files changed, 1408 insertions(+), 20 deletions(-)
>>
>>         +
>>         +
>>         +       /* It takes a fairly long time for a new MUX
>>         configuration to
>>         +        * be be applied after these register writes. This delay
>>         +        * duration was derived empirically based on the
>>         render_basic
>>         +        * config but hopefully it covers the maximum
>> configuration
>>         +        * latency...
>>         +        */
>>         +       mdelay(100);
>>
>>
>>     With such a HW and SW design, how can we ever expose hope to get any
>>     kind of performance when we are trying to monitor different metrics
>>     on each
>>     draw call? This may be acceptable for system monitoring, but it is
>>     problematic
>>     for the GL extensions :s
>>
>>
>>     Since it seems like we are going for a perf API, it means that for
>>     every change
>>     of metrics, we need to flush the commands, wait for the GPU to be
>>     done, then
>>     program the new set of metrics via an IOCTL, wait 100 ms, and then
>>     we may
>>     resume rendering ... until the next change. We are talking about a
>>     latency of
>>     6-7 frames at 60 Hz here... this is non-negligeable...
>>
>>
>>     I understand that we have a ton of counters and we may hide latency
>>     by not
>>     allowing using more than half of the counters for every draw call or
>>     frame, but
>>     even then, this 100ms delay is killing this approach altogether.
>>
>>
>>
>> Although I'm also really unhappy about introducing this delay recently,
>> the impact of the delay is typically amortized somewhat by keeping a
>> configuration open as long as possible.
>>
>> Even without this explicit delay here the OA unit isn't suited to being
>> reconfigured on a per draw call basis, though it is able to support per
>> draw call queries with the same config.
>>
>> The above assessment assumes wanting to change config between draw calls
>> which is not something this driver aims to support - as the HW isn't
>> really designed for that model.
>>
>> E.g. in the case of INTEL_performance_query, the backend keeps the i915
>> perf stream open until all OA based query objects are deleted - so you
>> have to be pretty explicit if you want to change config.
>>
>
> OK, I get your point. However, I still want to state that applications
> changing the set would see a disastrous effect as a 100 ms is enough to
> downclock both the CPU and GPU and that would dramatically alter the
> metrics. Should we make it clear somewhere, either in the
> INTEL_performance_query or as a warning in mesa_performance if changing the
> set while running? I would think the latter would be preferable as it could
> also cover the case of the AMD extension which, IIRC, does not talk about
> the performance cost of changing the metrics. With this caveat made clear,
> it seems reasonable.
>

Yeah a KHR_debug performance warning sounds like a good idea.


>
>
>> In case you aren't familiar with how the GL_INTEL_performance_query side
>> of things works for OA counters; one thing to be aware of is that
>> there's a separate MI_REPORT_PERF_COUNT command that Mesa writes either
>> side of a query which writes all the counters for the current OA config
>> (as configured via this i915 perf interface) to a buffer. In addition to
>> collecting reports via MI_REPORT_PERF_COUNT Mesa also configures the
>> unit for periodic sampling to be able to account for potential counter
>> overflow.
>>
>
> Oh, the overflow case is mean. Doesn't the spec mandate the application to
> read at least every second? This is the case for the timestamp queries.
>

For a Haswell GT3 system with 40EUs @ 1GHz some aggregate EU counters may
overflow their 32bits in approximately 40milliseconds. It should be pretty
unusual to see a draw call last that long, but not unimaginable. Might also
be a good draw call to focus on profiling too :-)

For Gen8+ a bunch of the A counters can be reported with 40bits to mitigate
this issue.


>
>
>>
>> It also might be worth keeping in mind that per draw queries will anyway
>> trash the pipelining of work, since it's necessary to put stalls between
>> the draw calls to avoid conflated metrics (not to do with the details of
>> this driver) so use cases will probably be limited to those that just
>> want the draw call numbers but don't mind ruining overall
>> frame/application performance. Periodic sampling or swap-to-swap queries
>> would be better suited to cases that should minimize their impact.
>>
>
> Yes, I agree that there will always be a cost, but with the design
> implemented in nouveau (which barely involves the CPU at all), the
> pipelining is almost unaffected. As in, monitoring every draw call with a
> different metric would lower the performance of glxgears (worst case I
> could think off) but still keep thousands of FPS.
>

I guess it just has different trade offs.

While it sounds like we have a typically higher cost to reconfigure OA (at
least if touching the MUX) once the config is fixed (which can be done
before measuring anything), then I guess the pipelining for queries might
be slightly better with MI_REPORT_PERF_COUNT commands than something
requiring interrupting + executing work on the cpu to switch config (even
if it's cheaper than an OA re-config). I guess nouveau would have the same
need to insert GPU pipeline stalls (just gpu syncing with gpu) to avoid
conflating neighbouring draw call metrics, and maybe the bubbles from those
that can swallow the latency of the software methods.

glxgears might not really exaggerate draw call pipeline stall issues with
only 6 cheap primitives per gear. glxgears hammers context switching more
so than drawing anything. I think a pessimal case would be an app that
depends on large numbers of draw calls per frame that each do enough real
work that stalling for their completion is also measurable.

Funnily enough enabling the OA unit with glxgears can be kind of
problematic for Gen8+ which automatically writes reports on context switch
due to the spam of generating all of those context switch reports.


>
>> The driver is already usable with gputop with this delay and considering
>> how config changes are typically associated with user interaction I
>> wouldn't see this as a show stopper - even though it's not ideal. I
>> think the assertions about it being unusable with GL, were a little
>> overstated based on making frequent OA config changes which is not
>> really how the interface is intended to be used.
>>
>
> Yeah, but a performance warning in mesa, I would be OK with this change.
> Thanks for taking the time to explain!


A performance warning sounds like a sensible idea yup.

Regards,
- Robert


>
>
>>
>> Thanks for starting to take a look through the code.
>>
>> Kind Regards,
>> - Robert
>>
>
> Martin
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20160504/5c906ba4/attachment.html>


More information about the dri-devel mailing list