[PATCH 0/3] drm/panfrost: Expose HW counters to userspace

Mon May 13 15:00:08 UTC 2019

On Sun, May 12, 2019 at 03:40:26PM +0200, Boris Brezillon wrote:
> On Tue, 30 Apr 2019 09:49:51 -0600
> Jordan Crouse <jcrouse at codeaurora.org> wrote:
> 
> > On Tue, Apr 30, 2019 at 06:10:53AM -0700, Rob Clark wrote:
> > > On Tue, Apr 30, 2019 at 5:42 AM Boris Brezillon
> > > <boris.brezillon at collabora.com> wrote:  
> > > >
> > > > +Rob, Eric, Mark and more
> > > >
> > > > Hi,
> > > >
> > > > On Fri, 5 Apr 2019 16:20:45 +0100
> > > > Steven Price <steven.price at arm.com> wrote:
> > > >  
> > > > > On 04/04/2019 16:20, Boris Brezillon wrote:  
> > > > > > Hello,
> > > > > >
> > > > > > This patch adds new ioctls to expose GPU counters to userspace.
> > > > > > These will be used by the mesa driver (should be posted soon).
> > > > > >
> > > > > > A few words about the implementation: I followed the VC4/Etnaviv model
> > > > > > where perf counters are retrieved on a per-job basis. This allows one
> > > > > > to have get accurate results when there are users using the GPU
> > > > > > concurrently.
> > > > > > AFAICT, the mali kbase is using a different approach where several
> > > > > > users can register a performance monitor but with no way to have fined
> > > > > > grained control over what job/GPU-context to track.  
> > > > >
> > > > > mali_kbase submits overlapping jobs. The jobs on slot 0 and slot 1 can
> > > > > be from different contexts (address spaces), and mali_kbase also fully
> > > > > uses the _NEXT registers. So there can be a job from one context
> > > > > executing on slot 0 and a job from a different context waiting in the
> > > > > _NEXT registers. (And the same for slot 1). This means that there's no
> > > > > (visible) gap between the first job finishing and the second job
> > > > > starting. Early versions of the driver even had a throttle to avoid
> > > > > interrupt storms (see JOB_IRQ_THROTTLE) which would further delay the
> > > > > IRQ - but thankfully that's gone.
> > > > >
> > > > > The upshot is that it's basically impossible to measure "per-job"
> > > > > counters when running at full speed. Because multiple jobs are running
> > > > > and the driver doesn't actually know when one ends and the next starts.
> > > > >
> > > > > Since one of the primary use cases is to draw pretty graphs of the
> > > > > system load [1], this "per-job" information isn't all that relevant (and
> > > > > minimal performance overhead is important). And if you want to monitor
> > > > > just one application it is usually easiest to ensure that it is the only
> > > > > thing running.
> > > > >
> > > > > [1]
> > > > > https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer
> > > > >  
> > > > > > This design choice comes at a cost: every time the perfmon context
> > > > > > changes (the perfmon context is the list of currently active
> > > > > > perfmons), the driver has to add a fence to prevent new jobs from
> > > > > > corrupting counters that will be dumped by previous jobs.
> > > > > >
> > > > > > Let me know if that's an issue and if you think we should approach
> > > > > > things differently.  
> > > > >
> > > > > It depends what you expect to do with the counters. Per-job counters are
> > > > > certainly useful sometimes. But serialising all jobs can mess up the
> > > > > thing you are trying to measure the performance of.  
> > > >
> > > > I finally found some time to work on v2 this morning, and it turns out
> > > > implementing global perf monitors as done in mali_kbase means rewriting
> > > > almost everything (apart from the perfcnt layout stuff). I'm not against
> > > > doing that, but I'd like to be sure this is really what we want.
> > > >
> > > > Eric, Rob, any opinion on that? Is it acceptable to expose counters
> > > > through the pipe_query/AMD_perfmon interface if we don't have this
> > > > job (or at least draw call) granularity? If not, should we keep the
> > > > solution I'm proposing here to make sure counters values are accurate,
> > > > or should we expose perf counters through a non-standard API?  
> > > 
> > > I think if you can't do per-draw level granularity, then you should
> > > not try to implement AMD_perfmon..  instead the use case is more for a
> > > sort of "gpu top" app (as opposed to something like frameretrace which
> > > is taking per-draw-call level measurements from within the app.
> > > Things that use AMD_perfmon are going to, I think, expect to query
> > > values between individual glDraw calls, and you probably don't want to
> > > flush tile passes 500 times per frame.
> > > 
> > > (Although, I suppose if there are multiple GPUs where perfcntrs work
> > > this way, it might be an interesting exercise to think about coming up
> > > w/ a standardized API (debugfs perhaps?) to monitor the counters.. so
> > > you could have a single userspace tool that works across several
> > > different drivers.)  
> > 
> > I agree. We've been pondering a lot of the same issues for Adreno.
> 
> So, you also have those global perf counters that can't be retrieved
> through the cmdstream? After the discussion I had with Rob I had the
> impression freedreno was supporting the AMD_perfmon interface in a
> proper way.

It is true that we can read most of the counters from the cmdstream but we have
a limited number of global physical counters that have to be shared across
contexts and a metric ton of countables that can be selected. As long as we
have just the one process running AMD_perfmon then we are fine, but it breaks
down after that.

I am also interested in the 'gpu top' use case which can somewhat emulated
through a cmdstream assuming that we stick to a finite set of counters and
countables that won't conflict with extensions. This is fine for basic work but
anybody doing any serious performance work would run into some limitations
quickly.

And finally the kernel uses a few counters for its own purposes and I hate the
"gentlemen's agreement" between the user and the kernel as to which physical
counters each get ownership of. It would be a lot better to turn it into an
explicit dynamic reservation which I feel would be a natural outshoot of a
generic interface.

Jordan
-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project