[PATCH 0/3] drm/panfrost: Expose HW counters to userspace

Sun May 12 13:38:03 UTC 2019

On Sat, 11 May 2019 15:32:20 -0700
Alyssa Rosenzweig <alyssa at rosenzweig.io> wrote:

> Hi all,
> 
> As Steven Price explained, the "GPU top" kbase approach is often more
> useful and accurate than per-draw timing. 
> 
> For a 3D game inside a GPU-accelerated desktop, the games' counters
> *should* include desktop overhead. This external overhead does affect
> the game's performance, especially if the contexts are competing for
> resources like memory bandwidth. An isolated sample is easy to achieve
> running only the app of interest; in ideal conditions (zero-copy
> fullscreen), desktop interference is negligible. 
> 
> For driver developers, the system-wide measurements are preferable,
> painting a complete system picture and avoiding disruptions. There is no
> risk of confusion, as the driver developers understand how the counters
> are exposed. Further, benchmarks rendering direct to a GBM surface are
> available (glmark2-es2-drm), eliminating interference even with poor
> desktop performance.
> 
> For app developers, the confusion of multi-context interference is
> unfortunate. Nevertheless, if enabling counters were to slow down an
> app, the confusion could be worse. Consider second-order changes in the
> app's performance characteristics due to slowdown: if techniques like
> dynamic resolution scaling are employed, the counters' results can be
> invalid.  Likewise, even if the lower-performance counters are
> appropriate for purely graphical workloads, complex apps with variable
> CPU overhead (e.g. from an FPS-dependent physics engine) can further
> confound counters. Low-overhead system-wide measurements mitigate these
> concerns.

I'd just like to point out that dumping counters the way
mali_kbase/gator does likely has an impact on perfs (at least on some
GPUs) because of the clean+invalidate-cache that happens before (or
after, I don't remember) each dump. IIUC and this cache is actually
global and not a per address space thing (which would be possible if the
cache lines contain a tag attaching them to a specific address space),
that means all jobs running when the clean+invalidate happens will have
extra cache misses after each dump. Of course, that's not as invasive as
the full serialization that happens with my solution, but it's not free
either.

> 
> As Rob Clark suggested, system-wide counters could be exposed via a
> semi-standardized interface, perhaps within debugfs/sysfs. The interface
> could not be completely standard, as the list of counters exposed varies
> substantially by vendor and model. Nevertheless, the mechanics of
> discovering, enabling, reading, and disabling counters can be
> standardized, as can a small set of universally meaningful counters like
> total GPU utilization. This would permit a vendor-independent GPU top
> app as suggested, as is I believe currently possible with
> vendor-specific downstream kernels (e.g. via Gator/Streamline for Mali)
> 
> It looks like this discussion is dormant. Could we try to get this
> sorted? For Panfrost, I'm hitting GPU-side bottlenecks that I'm unable
> to diagnose without access to the counters, so I'm eager for a mainline
> solution to be implemented.

I spent a bit of time thinking about it and looking at different
solutions.

debugfs/sysfs might not be the best solution, especially if we think
about the multi-user case (several instances of GPU perfmon tool
running in parallel), if we want it to work properly we need a way to
instantiate several perf monitors and let the driver add values to all
active perfmons everytime a dump happens (no matter who triggered the
dump). That's exactly what mali_kbase/gator does BTW. That's achievable
through debugs if we consider exposing a knob to instantiate such
perfmon instances, but that also means risking perfmon leaks if the
user does not take care of killing the perfmon it created when it's done
with it (or when it crashes). It might also prove hard to expose that to
non-root users in a secure way.

I also had a quick look at the perf_event interface to see if we could
extend it to support monitoring GPU events. I might be wrong as I
didn't spend much time investigating how it works, but it seems that
perf counters are saved/dumped/restored at each thread context switch,
which is not what we want here (might add extra perfcnt dump points
thus impacting GPU perfs more than we expect).

So maybe the best option is a pseudo-generic ioctl-based interface to
expose those perf counters.