[PATCH 0/3] drm/panfrost: Expose HW counters to userspace

Mon May 13 14:13:38 UTC 2019

On 13/05/2019 14:39, Boris Brezillon wrote:
> On Mon, 13 May 2019 13:48:08 +0100
> Steven Price <steven.price at arm.com> wrote:
> 
>> On 12/05/2019 14:38, Boris Brezillon wrote:
>>> On Sat, 11 May 2019 15:32:20 -0700
>>> Alyssa Rosenzweig <alyssa at rosenzweig.io> wrote:
>>>   
>>>> Hi all,
>>>>
>>>> As Steven Price explained, the "GPU top" kbase approach is often more
>>>> useful and accurate than per-draw timing. 
>>>>
>>>> For a 3D game inside a GPU-accelerated desktop, the games' counters
>>>> *should* include desktop overhead. This external overhead does affect
>>>> the game's performance, especially if the contexts are competing for
>>>> resources like memory bandwidth. An isolated sample is easy to achieve
>>>> running only the app of interest; in ideal conditions (zero-copy
>>>> fullscreen), desktop interference is negligible. 
>>>>
>>>> For driver developers, the system-wide measurements are preferable,
>>>> painting a complete system picture and avoiding disruptions. There is no
>>>> risk of confusion, as the driver developers understand how the counters
>>>> are exposed. Further, benchmarks rendering direct to a GBM surface are
>>>> available (glmark2-es2-drm), eliminating interference even with poor
>>>> desktop performance.
>>>>
>>>> For app developers, the confusion of multi-context interference is
>>>> unfortunate. Nevertheless, if enabling counters were to slow down an
>>>> app, the confusion could be worse. Consider second-order changes in the
>>>> app's performance characteristics due to slowdown: if techniques like
>>>> dynamic resolution scaling are employed, the counters' results can be
>>>> invalid.  Likewise, even if the lower-performance counters are
>>>> appropriate for purely graphical workloads, complex apps with variable
>>>> CPU overhead (e.g. from an FPS-dependent physics engine) can further
>>>> confound counters. Low-overhead system-wide measurements mitigate these
>>>> concerns.  
>>>
>>> I'd just like to point out that dumping counters the way
>>> mali_kbase/gator does likely has an impact on perfs (at least on some
>>> GPUs) because of the clean+invalidate-cache that happens before (or
>>> after, I don't remember) each dump. IIUC and this cache is actually  
>>
>> The clean is required after the dump to ensure that the CPU can see the
>> dumped counters (otherwise they could be stuck in the GPU's cache).
> 
> Just to make it clear, I was not questioning the need for a cache clean
> here, I was simply pointing out that dumping counters is not free even
> when we don't add those serialization points. Note that dumping
> counters also has an impact on the cache-miss counter of future dumps,
> but there's nothing we can do about that no matter the solution we go
> for.

Indeed, counters go through the caches, so clearly will have an impact
on the cache performance. But usually the impact is sufficiently "in the
noise" that it doesn't matter. But this is a hardware limitation.

> [..]
>>
>> Sadly the hardware doesn't help here and some (most?) GPU revisions will
>> saturate the counters rather than wrapping. Hence the need to frequently
>> poll and reset the counters, even if you only want the 'start'/'end' values.
> 
> Not sure wrapping would be better, as you can't tell how many times
> you've reached the max val.
> 
> BTW, counters are reset every time we do a dump, right? If that's the
> case, there's no risk to have end_val < start_val and the only risk is
> to get inaccurate values when counter_val > U32_MAX, but reaching
> U32_MAX already means that this event occurred a lot during the
> monitoring period.

Yes the h/w resets the counters when dumping. What I intended to say
(and re-reading it didn't do a good job) was that it would be quite
helpful if the hardware didn't reset and didn't saturate. That way
multiple uses could simply observe the counter values. The combination
of saturating and resetting on read means that there has to only be done
entity reading the counter values from the hardware and summing up.

I *think* later hardware might have dropped the saturating logic
(because it costs gates and is fairly useless for software), but sadly
because of backwards compatibility the resetting on read behaviour isn't
going to change.

>>
>> Having only one set of counters might simplify the potential for "leaks"
>> - although you still want to ensure that if nobody cares you stop
>> collecting counters...
> 
> Handling that with a debugfs-based iface implies implementing some
> heuristic like 'counters haven't been dumped for a long time, let's
> stop monitoring them', and I'm not a big fan of such things.

Agreed, it's fairly ugly. If you expose "native" sized counters (i.e.
u32) then there's a reasonable heuristic: if (1^32/GPU clock frequency)
has passed then the counters might have overflowed so are unlikely to be
useful to anyone still looking at them.

> This being said, I think I'll go for a simple debugfs-based iface to
> unblock Alyssa. debugfs is not part of the stable-ABI and I guess we
> can agree on explicitly marking it as "unstable" so that when we settle
> on a generic interface to expose such counters we can get rid of the
> old one.

A generic interface would be good, but short-term I can definitely see
the value of exporting the counters in debugfs.

Steve