[PATCH 2/2] drm/panfrost: adjusted job affinity for dual core group GPUs

Thu Jan 13 11:22:19 UTC 2022

On 13/01/2022 10:01, Alexey Sheplyakov wrote:
> Hi, Steven!
> 
> Thanks for such a detailed explanation of T628 peculiarities.
> 
> On Wed, Jan 12, 2022 at 05:03:15PM +0000, Steven Price wrote:
>> On 10/01/2022 17:42, Alyssa Rosenzweig wrote:
>>>> Whether it's worth the effort depends on whether anyone really cares
>>>> about getting the full performance out of this particular GPU.
>>>>
>>>> At this stage I think the main UABI change would be to add the opposite
>>>> flag to kbase, (e.g. "PANFROST_JD_DOESNT_NEED_COHERENCY_ON_GPU"[1]) to
>>>> opt-in to allowing the job to run across all cores.
>>>>
>>>> The second change would be to allow compute jobs to be run on the second
>>>> core group, so another flag: PANFROST_RUN_ON_SECOND_CORE_GROUP.
>>>>
>>>> But clearly there's little point adding such flags until someone steps
>>>> up to do the Mesa work.
>>>
>>> I worry about the maintainence burden (both Mesa and kernel) of adding
>>> UABI only used by a piece of hardware none of us own, and only useful
>>> "sometimes" for that hardware. Doubly so for the second core group
>>> support; currently Mesa doesn't advertise any compute support on
>>> anything older than Mali T760 ... to the best of my knowledge, nobody
>>> has missed that support either...
>>
>> I agree there's no point adding the UABI support unless someone is
>> willing to step and be a maintainer for that hardware. And I suspect no
>> one cares enough about that hardware to do that.
>>
>>> To be clear I am in favour of merging the patches needed for GLES2 to
>>> work on all Malis, possibly at a performance cost on these dual-core
>>> systems. That's a far cry from the level of support the DDK gave these
>>> chips back in the day ... of course, the DDK doesn't support them at all
>>> anymore, so Panfrost wins there by default! ;)
>>>
>>
>> Agreed - I'm happy to merge a kernel series similar to this. I think the
>> remaining problems are:
>>
>> 1. Addressing Robin's concerns about the first patch. That looks like
>> it's probably just wrong.
> 
> The first patch is wrong and I'll drop it.
> 
>> 2. I think this patch is too complex for the basic support. There's some
>> parts like checking GROUPS_L2_COHERENT which also don't feature in kbase
> 
> That part has been adapted from kbase_gpuprops_construct_coherent_groups, see
> https://github.com/hardkernel/linux/blob/2f0f4268209ddacc2cdea158104b87cedacbd0e3/drivers/gpu/arm/midgard/mali_kbase_gpuprops.c#L94

Gah! I was looking at the wrong version of kbase... ;)

>> so I don't believe are correct.
> 
> I'm not sure if it's correct or not, however
> - it does not change anything for GPUs with coherent L2 caches
> - it appears to correctly figure out core groups for several SoCs
>   with T628 GPU (BE-M1000, Exynos 5422).

Yeah, sorry about that - you are right this matches the (earlier) kbase
versions. I don't believe there ever was a GPU released without
GROUPS_L2_COHERENT set (which means that shader cores are coherent
within an L2, *NOT* that the L2 is coherent).

I think I was having an off day yesterday... ;) Ignore this - it's
arguably cargo-culting from kbase, but there are advantages to that.

>> 3. I don't think this blocks the change. But if we're not using the
>> second core group we could actually power it down. Indeed simply not
>> turning on the L2/shader cores should in theory work (jobs are not
>> scheduled to cores which are turned off even if they are included in the
>> affinity).
> 
> Powering off unused GPU is would be nice, however
> 
> 1) the userspace might power on those cores again (via sysfs or something),
>    so I prefer to explicitly schedule jobs to the core group 0.

We don't expose that level of control to user space. Currently panfrost
either powers on all cores or none.

> 2) on BE-M1000 GPU seems to lock up in a few seconds after powering off
>    some (GPU) cores. In fact I had to disable GPU devfreq to prevent GPU
>    lockups.

I'm not sure how you tested powering off some GPU cores - I'm surprised
that that causes problems. Having to disable GPU devfreq points to an
issue with the DVFS performance points in your DT (perhaps the voltages
are wrong?) or potentially a bug in the driving of the clocks/regulators.

> Therefore I consider powering off unused cores as a later optimization. 
> (frankly speaking I'd better put the effort to *making use* of those cores
> instead of figuring out why they fail to power down properly).

Making use is much more work - it depends if you(/anyone) cares about
power consumption on these devices. Although whether the hardware
actually implements any meaningful power gating for the second core
group is another matter so perhaps it wouldn't make any difference anyway.

My thought was something along the lines of the below which just turns
the cores off and otherwise doesn't require playing with affinities. If
we actually want to use the second core group then much more thought
will have to go into how the job slots are used.

---8<---

diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c b/drivers/gpu/drm/panfrost/panfrost_gpu.c
index bbe628b306ee..2e9f9d1ee830 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gpu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c
@@ -320,19 +320,34 @@ void panfrost_gpu_power_on(struct panfrost_device *pfdev)
 {
 	int ret;
 	u32 val;
+	u32 core_mask = U32_MAX;
 
 	panfrost_gpu_init_quirks(pfdev);
 
-	/* Just turn on everything for now */
-	gpu_write(pfdev, L2_PWRON_LO, pfdev->features.l2_present);
+	if (pfdev->features.l2_present != 1) {
+		/*
+		 * Only support one core group for now.
+		 * ~(l2_present - 1) unsets all bits in l2_present except the
+		 * bottom bit. (l2_present - 2) has all the bits in the first
+		 * core group set. AND them together to generate a mask of
+		 * cores in the first core group.
+		 */
+		core_mask = ~(pfdev->features.l2_present - 1) &
+			     (pfdev->features.l2_present - 2);
+	}
+
+	gpu_write(pfdev, L2_PWRON_LO, pfdev->features.l2_present & core_mask);
 	ret = readl_relaxed_poll_timeout(pfdev->iomem + L2_READY_LO,
-		val, val == pfdev->features.l2_present, 100, 20000);
+		val, val == (pfdev->features.l2_present & core_mask),
+		100, 20000);
 	if (ret)
 		dev_err(pfdev->dev, "error powering up gpu L2");
 
-	gpu_write(pfdev, SHADER_PWRON_LO, pfdev->features.shader_present);
+	gpu_write(pfdev, SHADER_PWRON_LO,
+		  pfdev->features.shader_present & core_mask);
 	ret = readl_relaxed_poll_timeout(pfdev->iomem + SHADER_READY_LO,
-		val, val == pfdev->features.shader_present, 100, 20000);
+		val, val == (pfdev->features.shader_present & core_mask),
+		100, 20000);
 	if (ret)
 		dev_err(pfdev->dev, "error powering up gpu shader");
 
---8<---
I don't have a dual-core group system to hand so this is mostly untested.

Thanks,

Steve