[PATCH 0/3] drm/panfrost: Expose HW counters to userspace

Sun May 12 13:40:26 UTC 2019

On Tue, 30 Apr 2019 09:49:51 -0600
Jordan Crouse <jcrouse at codeaurora.org> wrote:

> On Tue, Apr 30, 2019 at 06:10:53AM -0700, Rob Clark wrote:
> > On Tue, Apr 30, 2019 at 5:42 AM Boris Brezillon
> > <boris.brezillon at collabora.com> wrote:  
> > >
> > > +Rob, Eric, Mark and more
> > >
> > > Hi,
> > >
> > > On Fri, 5 Apr 2019 16:20:45 +0100
> > > Steven Price <steven.price at arm.com> wrote:
> > >  
> > > > On 04/04/2019 16:20, Boris Brezillon wrote:  
> > > > > Hello,
> > > > >
> > > > > This patch adds new ioctls to expose GPU counters to userspace.
> > > > > These will be used by the mesa driver (should be posted soon).
> > > > >
> > > > > A few words about the implementation: I followed the VC4/Etnaviv model
> > > > > where perf counters are retrieved on a per-job basis. This allows one
> > > > > to have get accurate results when there are users using the GPU
> > > > > concurrently.
> > > > > AFAICT, the mali kbase is using a different approach where several
> > > > > users can register a performance monitor but with no way to have fined
> > > > > grained control over what job/GPU-context to track.  
> > > >
> > > > mali_kbase submits overlapping jobs. The jobs on slot 0 and slot 1 can
> > > > be from different contexts (address spaces), and mali_kbase also fully
> > > > uses the _NEXT registers. So there can be a job from one context
> > > > executing on slot 0 and a job from a different context waiting in the
> > > > _NEXT registers. (And the same for slot 1). This means that there's no
> > > > (visible) gap between the first job finishing and the second job
> > > > starting. Early versions of the driver even had a throttle to avoid
> > > > interrupt storms (see JOB_IRQ_THROTTLE) which would further delay the
> > > > IRQ - but thankfully that's gone.
> > > >
> > > > The upshot is that it's basically impossible to measure "per-job"
> > > > counters when running at full speed. Because multiple jobs are running
> > > > and the driver doesn't actually know when one ends and the next starts.
> > > >
> > > > Since one of the primary use cases is to draw pretty graphs of the
> > > > system load [1], this "per-job" information isn't all that relevant (and
> > > > minimal performance overhead is important). And if you want to monitor
> > > > just one application it is usually easiest to ensure that it is the only
> > > > thing running.
> > > >
> > > > [1]
> > > > https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer
> > > >  
> > > > > This design choice comes at a cost: every time the perfmon context
> > > > > changes (the perfmon context is the list of currently active
> > > > > perfmons), the driver has to add a fence to prevent new jobs from
> > > > > corrupting counters that will be dumped by previous jobs.
> > > > >
> > > > > Let me know if that's an issue and if you think we should approach
> > > > > things differently.  
> > > >
> > > > It depends what you expect to do with the counters. Per-job counters are
> > > > certainly useful sometimes. But serialising all jobs can mess up the
> > > > thing you are trying to measure the performance of.  
> > >
> > > I finally found some time to work on v2 this morning, and it turns out
> > > implementing global perf monitors as done in mali_kbase means rewriting
> > > almost everything (apart from the perfcnt layout stuff). I'm not against
> > > doing that, but I'd like to be sure this is really what we want.
> > >
> > > Eric, Rob, any opinion on that? Is it acceptable to expose counters
> > > through the pipe_query/AMD_perfmon interface if we don't have this
> > > job (or at least draw call) granularity? If not, should we keep the
> > > solution I'm proposing here to make sure counters values are accurate,
> > > or should we expose perf counters through a non-standard API?  
> > 
> > I think if you can't do per-draw level granularity, then you should
> > not try to implement AMD_perfmon..  instead the use case is more for a
> > sort of "gpu top" app (as opposed to something like frameretrace which
> > is taking per-draw-call level measurements from within the app.
> > Things that use AMD_perfmon are going to, I think, expect to query
> > values between individual glDraw calls, and you probably don't want to
> > flush tile passes 500 times per frame.
> > 
> > (Although, I suppose if there are multiple GPUs where perfcntrs work
> > this way, it might be an interesting exercise to think about coming up
> > w/ a standardized API (debugfs perhaps?) to monitor the counters.. so
> > you could have a single userspace tool that works across several
> > different drivers.)  
> 
> I agree. We've been pondering a lot of the same issues for Adreno.

So, you also have those global perf counters that can't be retrieved
through the cmdstream? After the discussion I had with Rob I had the
impression freedreno was supporting the AMD_perfmon interface in a
proper way.