[PATCH] drm/panfrost: Really power off GPU cores in panfrost_gpu_power_off()

Tue Nov 21 17:08:44 UTC 2023

On 21/11/2023 17:55, Boris Brezillon wrote:
> On Tue, 21 Nov 2023 17:11:42 +0100
> AngeloGioacchino Del Regno <angelogioacchino.delregno at collabora.com>
> wrote:
> 
>> Il 21/11/23 16:34, Krzysztof Kozlowski ha scritto:
>>> On 08/11/2023 14:20, Steven Price wrote:  
>>>> On 02/11/2023 14:15, AngeloGioacchino Del Regno wrote:  
>>>>> The layout of the registers {TILER,SHADER,L2}_PWROFF_LO, used to request
>>>>> powering off cores, is the same as the {TILER,SHADER,L2}_PWRON_LO ones:
>>>>> this means that in order to request poweroff of cores, we are supposed
>>>>> to write a bitmask of cores that should be powered off!
>>>>> This means that the panfrost_gpu_power_off() function has always been
>>>>> doing nothing.
>>>>>
>>>>> Fix powering off the GPU by writing a bitmask of the cores to poweroff
>>>>> to the relevant PWROFF_LO registers and then check that the transition
>>>>> (from ON to OFF) has finished by polling the relevant PWRTRANS_LO
>>>>> registers.
>>>>>
>>>>> While at it, in order to avoid code duplication, move the core mask
>>>>> logic from panfrost_gpu_power_on() to a new panfrost_get_core_mask()
>>>>> function, used in both poweron and poweroff.
>>>>>
>>>>> Fixes: f3ba91228e8e ("drm/panfrost: Add initial panfrost driver")
>>>>> Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno at collabora.com>  
>>>
>>>
>>> Hi,
>>>
>>> This commit was added to next recently but it causes "external abort on
>>> non-linefetch" during boot of my Odroid HC1 board.
>>>
>>> At least bisect points to it.
>>>
>>> If fixed, please add:
>>>
>>> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski at linaro.org>
>>>
>>> [    4.861683] 8<--- cut here ---
>>> [    4.863429] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0c8802c
>>> [    4.871018] [f0c8802c] *pgd=433ed811, *pte=11800653, *ppte=11800453
>>> ...
>>> [    5.164010]  panfrost_gpu_irq_handler from __handle_irq_event_percpu+0xcc/0x31c
>>> [    5.171276]  __handle_irq_event_percpu from handle_irq_event+0x38/0x80
>>> [    5.177765]  handle_irq_event from handle_fasteoi_irq+0x9c/0x250
>>> [    5.183743]  handle_fasteoi_irq from generic_handle_domain_irq+0x28/0x38
>>> [    5.190417]  generic_handle_domain_irq from gic_handle_irq+0x88/0xa8
>>> [    5.196741]  gic_handle_irq from generic_handle_arch_irq+0x34/0x44
>>> [    5.202893]  generic_handle_arch_irq from __irq_svc+0x8c/0xd0
>>>
>>> Full log:
>>> https://krzk.eu/#/builders/21/builds/4392/steps/11/logs/serial0
>>>   
>>
>> Hey Krzysztof,
>>
>> This is interesting. It might be about the cores that are missing from the partial
>> core_mask raising interrupts, but an external abort on non-linefetch is strange to
>> see here.
> 
> I've seen such external aborts in the past, and the fault type has
> often been misleading. It's unlikely to have anything to do with a

Yeah, often accessing device with power or clocks gated.

> non-linefetch access, but it might be caused by a register access after
> the clock or power domain driving the register bank has been disabled.
> The following diff might help validate this theory. If that works, we
> probably want to make sure we synchronize IRQs before disabling in the
> suspend path.
> 
> --->8---
> diff --git a/drivers/gpu/drm/panfrost/panfrost_regs.h b/drivers/gpu/drm/panfrost/panfrost_regs.h
> index 55ec807550b3..98df66e5cc9b 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_regs.h
> +++ b/drivers/gpu/drm/panfrost/panfrost_regs.h
> @@ -34,8 +34,6 @@
>           (GPU_IRQ_FAULT                        |\
>            GPU_IRQ_MULTIPLE_FAULT               |\
>            GPU_IRQ_RESET_COMPLETED              |\
> -          GPU_IRQ_POWER_CHANGED                |\
> -          GPU_IRQ_POWER_CHANGED_ALL            |\

This helped, at least for this issue (next-20231121). Much later in
user-space boot I have lockups:
watchdog: BUG: soft lockup - CPU#4 stuck for 26s! [kworker/4:1:61]

[   56.329224]  smp_call_function_single from
__sync_rcu_exp_select_node_cpus+0x29c/0x78c
[   56.337111]  __sync_rcu_exp_select_node_cpus from
sync_rcu_exp_select_cpus+0x334/0x878
[   56.344995]  sync_rcu_exp_select_cpus from wait_rcu_exp_gp+0xc/0x18
[   56.351231]  wait_rcu_exp_gp from process_one_work+0x20c/0x620
[   56.357038]  process_one_work from worker_thread+0x1d0/0x488
[   56.362668]  worker_thread from kthread+0x104/0x138
[   56.367521]  kthread from ret_from_fork+0x14/0x28

But anyway the external abort does not appear.

Best regards,
Krzysztof