[PATCH 0/3] drm/panfrost: Expose HW counters to userspace

Wed May 1 17:12:42 UTC 2019

Boris Brezillon <boris.brezillon at collabora.com> writes:

> +Rob, Eric, Mark and more
>
> Hi,
>
> On Fri, 5 Apr 2019 16:20:45 +0100
> Steven Price <steven.price at arm.com> wrote:
>
>> On 04/04/2019 16:20, Boris Brezillon wrote:
>> > Hello,
>> > 
>> > This patch adds new ioctls to expose GPU counters to userspace.
>> > These will be used by the mesa driver (should be posted soon).
>> > 
>> > A few words about the implementation: I followed the VC4/Etnaviv model
>> > where perf counters are retrieved on a per-job basis. This allows one
>> > to have get accurate results when there are users using the GPU
>> > concurrently.
>> > AFAICT, the mali kbase is using a different approach where several
>> > users can register a performance monitor but with no way to have fined
>> > grained control over what job/GPU-context to track.  
>> 
>> mali_kbase submits overlapping jobs. The jobs on slot 0 and slot 1 can
>> be from different contexts (address spaces), and mali_kbase also fully
>> uses the _NEXT registers. So there can be a job from one context
>> executing on slot 0 and a job from a different context waiting in the
>> _NEXT registers. (And the same for slot 1). This means that there's no
>> (visible) gap between the first job finishing and the second job
>> starting. Early versions of the driver even had a throttle to avoid
>> interrupt storms (see JOB_IRQ_THROTTLE) which would further delay the
>> IRQ - but thankfully that's gone.
>> 
>> The upshot is that it's basically impossible to measure "per-job"
>> counters when running at full speed. Because multiple jobs are running
>> and the driver doesn't actually know when one ends and the next starts.
>> 
>> Since one of the primary use cases is to draw pretty graphs of the
>> system load [1], this "per-job" information isn't all that relevant (and
>> minimal performance overhead is important). And if you want to monitor
>> just one application it is usually easiest to ensure that it is the only
>> thing running.
>> 
>> [1]
>> https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer
>> 
>> > This design choice comes at a cost: every time the perfmon context
>> > changes (the perfmon context is the list of currently active
>> > perfmons), the driver has to add a fence to prevent new jobs from
>> > corrupting counters that will be dumped by previous jobs.
>> > 
>> > Let me know if that's an issue and if you think we should approach
>> > things differently.  
>> 
>> It depends what you expect to do with the counters. Per-job counters are
>> certainly useful sometimes. But serialising all jobs can mess up the
>> thing you are trying to measure the performance of.
>
> I finally found some time to work on v2 this morning, and it turns out
> implementing global perf monitors as done in mali_kbase means rewriting
> almost everything (apart from the perfcnt layout stuff). I'm not against
> doing that, but I'd like to be sure this is really what we want.
>
> Eric, Rob, any opinion on that? Is it acceptable to expose counters
> through the pipe_query/AMD_perfmon interface if we don't have this
> job (or at least draw call) granularity? If not, should we keep the
> solution I'm proposing here to make sure counters values are accurate,
> or should we expose perf counters through a non-standard API?

You should definitely not count perf results from someone else's context
against your own!  People doing perf analysis will expect slight
performance changes (like missing bin/render parallelism between
contexts) when doing perf queries, but they will be absolutely lost if
their non-texturing job starts showing texturing results from some
unrelated context.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20190501/2aa6ec73/attachment.sig>