[PATCH 0/3] drm/panfrost: Expose HW counters to userspace

Mon May 13 12:48:08 UTC 2019

On 12/05/2019 14:38, Boris Brezillon wrote:
> On Sat, 11 May 2019 15:32:20 -0700
> Alyssa Rosenzweig <alyssa at rosenzweig.io> wrote:
> 
>> Hi all,
>>
>> As Steven Price explained, the "GPU top" kbase approach is often more
>> useful and accurate than per-draw timing. 
>>
>> For a 3D game inside a GPU-accelerated desktop, the games' counters
>> *should* include desktop overhead. This external overhead does affect
>> the game's performance, especially if the contexts are competing for
>> resources like memory bandwidth. An isolated sample is easy to achieve
>> running only the app of interest; in ideal conditions (zero-copy
>> fullscreen), desktop interference is negligible. 
>>
>> For driver developers, the system-wide measurements are preferable,
>> painting a complete system picture and avoiding disruptions. There is no
>> risk of confusion, as the driver developers understand how the counters
>> are exposed. Further, benchmarks rendering direct to a GBM surface are
>> available (glmark2-es2-drm), eliminating interference even with poor
>> desktop performance.
>>
>> For app developers, the confusion of multi-context interference is
>> unfortunate. Nevertheless, if enabling counters were to slow down an
>> app, the confusion could be worse. Consider second-order changes in the
>> app's performance characteristics due to slowdown: if techniques like
>> dynamic resolution scaling are employed, the counters' results can be
>> invalid.  Likewise, even if the lower-performance counters are
>> appropriate for purely graphical workloads, complex apps with variable
>> CPU overhead (e.g. from an FPS-dependent physics engine) can further
>> confound counters. Low-overhead system-wide measurements mitigate these
>> concerns.
> 
> I'd just like to point out that dumping counters the way
> mali_kbase/gator does likely has an impact on perfs (at least on some
> GPUs) because of the clean+invalidate-cache that happens before (or
> after, I don't remember) each dump. IIUC and this cache is actually

The clean is required after the dump to ensure that the CPU can see the
dumped counters (otherwise they could be stuck in the GPU's cache). Note
that if you don't immediately need to observe the values the clean can
be deferred. E.g. if each dump you target a different memory region then
you could perform a flush only after several dumps.

> global and not a per address space thing (which would be possible if the
> cache lines contain a tag attaching them to a specific address space),
> that means all jobs running when the clean+invalidate happens will have
> extra cache misses after each dump. Of course, that's not as invasive as
> the full serialization that happens with my solution, but it's not free
> either.

Cache cleans (at the L2 cache level) are global. There are L1 caches
(and TLBs) but they are not relevant to counter dumping.

In practice cache cleans are not that expensive most of the time. As you
note - mali_kbase doesn't bother to avoid them when doing counter capture.

It should also be unnecessary to "invalidate" - a clean should be
sufficient. The invalidate is only required if e.g. MMU page tables have
been modified as well. That would reduce the affect on currently running
jobs. But I notice that mali_kbase doesn't appear to use the "clean
only" (GPU_COMMAND_CLEAN_CACHES) option. I'm not aware of it being
broken though.

>>
>> As Rob Clark suggested, system-wide counters could be exposed via a
>> semi-standardized interface, perhaps within debugfs/sysfs. The interface
>> could not be completely standard, as the list of counters exposed varies
>> substantially by vendor and model. Nevertheless, the mechanics of
>> discovering, enabling, reading, and disabling counters can be
>> standardized, as can a small set of universally meaningful counters like
>> total GPU utilization. This would permit a vendor-independent GPU top
>> app as suggested, as is I believe currently possible with
>> vendor-specific downstream kernels (e.g. via Gator/Streamline for Mali)
>>
>> It looks like this discussion is dormant. Could we try to get this
>> sorted? For Panfrost, I'm hitting GPU-side bottlenecks that I'm unable
>> to diagnose without access to the counters, so I'm eager for a mainline
>> solution to be implemented.
> 
> I spent a bit of time thinking about it and looking at different
> solutions.
> 
> debugfs/sysfs might not be the best solution, especially if we think
> about the multi-user case (several instances of GPU perfmon tool
> running in parallel), if we want it to work properly we need a way to
> instantiate several perf monitors and let the driver add values to all
> active perfmons everytime a dump happens (no matter who triggered the
> dump). That's exactly what mali_kbase/gator does BTW. That's achievable
> through debugs if we consider exposing a knob to instantiate such
> perfmon instances, but that also means risking perfmon leaks if the
> user does not take care of killing the perfmon it created when it's done
> with it (or when it crashes). It might also prove hard to expose that to
> non-root users in a secure way.

Note that mali_kbase only has to maintain multiple sets of values
because it provides backwards compatibility to an old version of the
driver. You could maintain one set of counter values and simply ensure
they are big enough not to wrap (or handle wrapping, but that is ugly
and unlikely to happen). Very early versions of kbase simply exposed the
hardware functionality (dump counters and reset) and therefore faced the
issue that there could only ever be one user of counters at a time. This
was "fixed" by emulating the existence of multiple interfaces (vinstr -
"virtual" instrumentation). If I was redesigning it from scratch then
I'd definitely remove the "reset" part of the interface and leave it for
the clients to latch the first value (of each counter) and subtract that
from subsequent counter reads.

Sadly the hardware doesn't help here and some (most?) GPU revisions will
saturate the counters rather than wrapping. Hence the need to frequently
poll and reset the counters, even if you only want the 'start'/'end' values.

Having only one set of counters might simplify the potential for "leaks"
- although you still want to ensure that if nobody cares you stop
collecting counters...

Steve

> I also had a quick look at the perf_event interface to see if we could
> extend it to support monitoring GPU events. I might be wrong as I
> didn't spend much time investigating how it works, but it seems that
> perf counters are saved/dumped/restored at each thread context switch,
> which is not what we want here (might add extra perfcnt dump points
> thus impacting GPU perfs more than we expect).
> 
> So maybe the best option is a pseudo-generic ioctl-based interface to
> expose those perf counters.
> _______________________________________________
> dri-devel mailing list
> dri-devel at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>