[RESEND PATCH v2 2/2] drm/panfrost: Expose perf counters through debugfs

Wed May 29 06:55:18 UTC 2019

On Tue, 28 May 2019 16:01:15 +0200
Boris Brezillon <boris.brezillon at collabora.com> wrote:

> On Tue, 28 May 2019 15:53:48 +0200
> Boris Brezillon <boris.brezillon at collabora.com> wrote:
> 
> > Hi Steve,
> > 
> > On Thu, 16 May 2019 17:21:39 +0100
> > Steven Price <steven.price at arm.com> wrote:
> > 
> >   
> > > >> +static void panfrost_perfcnt_setup(struct panfrost_device *pfdev)
> > > >> +{
> > > >> +       u32 cfg;
> > > >> +
> > > >> +       /*
> > > >> +        * Always use address space 0 for now.
> > > >> +        * FIXME: this needs to be updated when we start using different
> > > >> +        * address space.
> > > >> +        */
> > > >> +       cfg = GPU_PERFCNT_CFG_AS(0);
> > > >> +       if (panfrost_model_cmp(pfdev, 0x1000) >= 0)      
> > > > 
> > > > You've got a couple of these. Perhaps we should add either a
> > > > model_is_bifrost() helper or an is_bifrost variable to use.
> > > >       
> > > >> +               cfg |= GPU_PERFCNT_CFG_SETSEL(1);      
> > > 
> > > mali_kbase only sets this if CONFIG_MALI_PRFCNT_SET_SECONDARY is defined
> > > - i.e. this selects a different selection of counters. At at the very
> > > least this should be an optional flag that can be set, but I suspect the
> > > primary set are the ones you are interested in.    
> > 
> > I'll add a param to pass the set of counters to monitor.
> >   
> > >     
> > > >> +
> > > >> +       gpu_write(pfdev, GPU_PERFCNT_CFG,
> > > >> +                 cfg | GPU_PERFCNT_CFG_MODE(GPU_PERFCNT_CFG_MODE_OFF));
> > > >> +
> > > >> +       if (!pfdev->perfcnt->enabled)
> > > >> +               return;
> > > >> +
> > > >> +       gpu_write(pfdev, GPU_PRFCNT_JM_EN, 0xffffffff);
> > > >> +       gpu_write(pfdev, GPU_PRFCNT_SHADER_EN, 0xffffffff);
> > > >> +       gpu_write(pfdev, GPU_PRFCNT_MMU_L2_EN, 0xffffffff);
> > > >> +
> > > >> +       /*
> > > >> +        * Due to PRLAM-8186 we need to disable the Tiler before we enable HW
> > > >> +        * counters.      
> > > > 
> > > > Seems like you could just always apply the work-around? It doesn't
> > > > look like it would be much different.      
> > > 
> > > Technically the workaround causes the driver to perform the operation in
> > > the wrong order below (write TILER_EN when the mode is MANUAL) - this is
> > > a workaround for the hardware issue (and therefore works on that
> > > hardware). But I wouldn't like to say it works on other hardware which
> > > doesn't have the issue.    
> > 
> > Okay, I'll keep the workaround only for HW that are impacted by the bug.
> >   
> > >     
> > > >> +        */
> > > >> +       if (panfrost_has_hw_issue(pfdev, HW_ISSUE_8186))
> > > >> +               gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0);
> > > >> +       else
> > > >> +               gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0xffffffff);
> > > >> +
> > > >> +       gpu_write(pfdev, GPU_PERFCNT_CFG,
> > > >> +                 cfg | GPU_PERFCNT_CFG_MODE(GPU_PERFCNT_CFG_MODE_MANUAL));
> > > >> +
> > > >> +       if (panfrost_has_hw_issue(pfdev, HW_ISSUE_8186))
> > > >> +               gpu_write(pfdev, GPU_PRFCNT_TILER_EN, 0xffffffff);
> > > >> +}    
> > 
> > [...]
> >   
> > > >> +int panfrost_perfcnt_init(struct panfrost_device *pfdev)
> > > >> +{
> > > >> +       struct panfrost_perfcnt *perfcnt;
> > > >> +       struct drm_gem_shmem_object *bo;
> > > >> +       size_t size;
> > > >> +       u32 status;
> > > >> +       int ret;
> > > >> +
> > > >> +       if (panfrost_has_hw_feature(pfdev, HW_FEATURE_V4)) {
> > > >> +               unsigned int ncoregroups;
> > > >> +
> > > >> +               ncoregroups = hweight64(pfdev->features.l2_present);
> > > >> +               size = ncoregroups * BLOCKS_PER_COREGROUP *
> > > >> +                      COUNTERS_PER_BLOCK * BYTES_PER_COUNTER;
> > > >> +       } else {
> > > >> +               unsigned int nl2c, ncores;
> > > >> +
> > > >> +               /*
> > > >> +                * TODO: define a macro to extract the number of l2 caches from
> > > >> +                * mem_features.
> > > >> +                */
> > > >> +               nl2c = ((pfdev->features.mem_features >> 8) & GENMASK(3, 0)) + 1;
> > > >> +
> > > >> +               /*
> > > >> +                * The ARM driver is grouping cores per core group and then
> > > >> +                * only using the number of cores in group 0 to calculate the
> > > >> +                * size. Not sure why this is done like that, but I guess
> > > >> +                * shader_present will only show cores in the first group
> > > >> +                * anyway.
> > > >> +                */      
> > > 
> > > I'm not sure I understand this comment. I'm not sure which version of
> > > the driver you are looking at, but shader_present should contain all the
> > > cores (in every core group).
> > > 
> > > The kbase driver (in panfrost's git tree[1]), contains:
> > > 
> > > 	base_gpu_props *props = &kbdev->gpu_props.props;
> > > 	u32 nr_l2 = props->l2_props.num_l2_slices;
> > > 	u64 core_mask = props->coherency_info.group[0].core_mask;
> > > 
> > > 	u32 nr_blocks = fls64(core_mask);
> > > 
> > > 	/* JM and tiler counter blocks are always present */
> > > 	dump_size = (2 + nr_l2 + nr_blocks) *
> > > 			NR_CNT_PER_BLOCK *
> > > 			NR_BYTES_PER_CNT;
> > > 
> > > [1]
> > > https://gitlab.freedesktop.org/panfrost/mali_kbase/blob/master/driver/product/kernel/drivers/gpu/arm/midgard/mali_kbase_vinstr.c
> > > 
> > > i.e. the number of cores + number of L2 caches (plus 2 for JM/tiler)
> > > multiplied by the block size.    
> > 
> > Actually, I was referring to what's done in
> > kbase_gpuprops_construct_coherent_groups() [2].
> >   
> > > 
> > > Also another difference with the below is that fls64 and hweight64 don't
> > > return the same numbers. If the core mask is non-contiguous this will
> > > return different values.    
> > 
> > Right, I should use fls64() not hweight64().  
> 
> Oops, forgot to ask the extra question I had. Looks like when the GPU
> is busy and I enable/dump/disable perfcnt too quickly, I sometime gets
> {JM,SHADER,MMU_L2,TILER}_COUNTER[2] = 0 while I'd expect it to be set
> to the {JM,SHADER,MMU_L2,TILER}_EN value (0xff or 0xffff depending on
> the block). I made sure I was receiving the SAMPLE_DONE+CLEAN_DONE
> events. Any idea what could cause this?
> 
> Note that when the GPU is idle, {JM,SHADER,MMU_L2,TILER}_COUNTER[2] are
> set to the value I expect (0xff or 0xffff depending on the block).

I think I found the issue. Looks like drm_gem_shmem_vmap() is mapping
the BO in cached mode and there's no helper to do cache maintenance
operation on this BO (dma_map/unmap_sg()). Mapping it writecombine
seems to do the trick (at least I no longer have those inconsistencies
on {JM,SHADER,MMU_L2,TILER}_COUNTER[2]).