[PATCH 0/3] drm/panfrost: Expose HW counters to userspace

Sun May 12 13:17:10 UTC 2019

On Wed, 01 May 2019 10:12:42 -0700
Eric Anholt <eric at anholt.net> wrote:

> Boris Brezillon <boris.brezillon at collabora.com> writes:
> 
> > +Rob, Eric, Mark and more
> >
> > Hi,
> >
> > On Fri, 5 Apr 2019 16:20:45 +0100
> > Steven Price <steven.price at arm.com> wrote:
> >  
> >> On 04/04/2019 16:20, Boris Brezillon wrote:  
> >> > Hello,
> >> > 
> >> > This patch adds new ioctls to expose GPU counters to userspace.
> >> > These will be used by the mesa driver (should be posted soon).
> >> > 
> >> > A few words about the implementation: I followed the VC4/Etnaviv model
> >> > where perf counters are retrieved on a per-job basis. This allows one
> >> > to have get accurate results when there are users using the GPU
> >> > concurrently.
> >> > AFAICT, the mali kbase is using a different approach where several
> >> > users can register a performance monitor but with no way to have fined
> >> > grained control over what job/GPU-context to track.    
> >> 
> >> mali_kbase submits overlapping jobs. The jobs on slot 0 and slot 1 can
> >> be from different contexts (address spaces), and mali_kbase also fully
> >> uses the _NEXT registers. So there can be a job from one context
> >> executing on slot 0 and a job from a different context waiting in the
> >> _NEXT registers. (And the same for slot 1). This means that there's no
> >> (visible) gap between the first job finishing and the second job
> >> starting. Early versions of the driver even had a throttle to avoid
> >> interrupt storms (see JOB_IRQ_THROTTLE) which would further delay the
> >> IRQ - but thankfully that's gone.
> >> 
> >> The upshot is that it's basically impossible to measure "per-job"
> >> counters when running at full speed. Because multiple jobs are running
> >> and the driver doesn't actually know when one ends and the next starts.
> >> 
> >> Since one of the primary use cases is to draw pretty graphs of the
> >> system load [1], this "per-job" information isn't all that relevant (and
> >> minimal performance overhead is important). And if you want to monitor
> >> just one application it is usually easiest to ensure that it is the only
> >> thing running.
> >> 
> >> [1]
> >> https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer
> >>   
> >> > This design choice comes at a cost: every time the perfmon context
> >> > changes (the perfmon context is the list of currently active
> >> > perfmons), the driver has to add a fence to prevent new jobs from
> >> > corrupting counters that will be dumped by previous jobs.
> >> > 
> >> > Let me know if that's an issue and if you think we should approach
> >> > things differently.    
> >> 
> >> It depends what you expect to do with the counters. Per-job counters are
> >> certainly useful sometimes. But serialising all jobs can mess up the
> >> thing you are trying to measure the performance of.  
> >
> > I finally found some time to work on v2 this morning, and it turns out
> > implementing global perf monitors as done in mali_kbase means rewriting
> > almost everything (apart from the perfcnt layout stuff). I'm not against
> > doing that, but I'd like to be sure this is really what we want.
> >
> > Eric, Rob, any opinion on that? Is it acceptable to expose counters
> > through the pipe_query/AMD_perfmon interface if we don't have this
> > job (or at least draw call) granularity? If not, should we keep the
> > solution I'm proposing here to make sure counters values are accurate,
> > or should we expose perf counters through a non-standard API?  
> 
> You should definitely not count perf results from someone else's context
> against your own!

Also had the feeling that doing that was a bad idea (which you and Rob
confirmed), but I listed it here to clear up my doubts.

> People doing perf analysis will expect slight
> performance changes (like missing bin/render parallelism between
> contexts) when doing perf queries, but they will be absolutely lost if
> their non-texturing job starts showing texturing results from some
> unrelated context.

Given the feedback I had so far, it looks like defining a new interface
for this type of 'global perfmon' is the way to go if we want to make
everyone happy. Also note that the work I've done on VC4 has the same
limitation: a perfmon context between 2 jobs will introduce a
serialization point. It seems that you're not as worried as Steven or
Rob is about extra cost if this serialization, but I wanted to point
that out.