[PATCH] drm/msm: Wait for MMU devcoredump when waiting for GMU

Fri Jul 25 22:12:18 UTC 2025

On 7/25/2025 1:31 AM, Connor Abbott wrote:
> On Thu, Jul 24, 2025 at 3:48 PM Akhil P Oommen <akhilpo at oss.qualcomm.com> wrote:
>>
>> On 7/21/2025 9:02 PM, Rob Clark wrote:
>>> On Fri, Jul 18, 2025 at 6:50 AM Connor Abbott <cwabbott0 at gmail.com> wrote:
>>>>
>>>> If there is a flood of faults then the MMU can become saturated while it
>>>> waits for the kernel to process the first fault and resume it, so that
>>>> the GMU becomes blocked. This is mainly a problem when the kernel reads
>>>> the state of the GPU for a devcoredump, because this takes a while. If
>>>> we timeout waiting for the GMU, check if this has happened and retry
>>>> after we're finished.
>>>>
>>>> Signed-off-by: Connor Abbott <cwabbott0 at gmail.com>
>>>> ---
>>>>  drivers/gpu/drm/msm/adreno/a6xx_gmu.c   | 21 ++++++++++++++++++---
>>>>  drivers/gpu/drm/msm/adreno/a6xx_hfi.c   | 21 ++++++++++++++++++---
>>>>  drivers/gpu/drm/msm/adreno/adreno_gpu.c | 11 +++++++++++
>>>>  drivers/gpu/drm/msm/adreno/adreno_gpu.h |  2 ++
>>>>  4 files changed, 49 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
>>>> index 28e6705c6da682c7b41c748e375dda59a6551898..6ec396fab22d194481a76d30b2d36ea5fb662241 100644
>>>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
>>>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
>>>> @@ -340,6 +340,7 @@ int a6xx_gmu_set_oob(struct a6xx_gmu *gmu, enum a6xx_gmu_oob_state state)
>>>>         int ret;
>>>>         u32 val;
>>>>         int request, ack;
>>>> +       struct a6xx_gpu *a6xx_gpu = container_of(gmu, struct a6xx_gpu, gmu);
>>>>
>>>>         WARN_ON_ONCE(!mutex_is_locked(&gmu->lock));
>>>>
>>>> @@ -363,9 +364,23 @@ int a6xx_gmu_set_oob(struct a6xx_gmu *gmu, enum a6xx_gmu_oob_state state)
>>>>         /* Trigger the equested OOB operation */
>>>>         gmu_write(gmu, REG_A6XX_GMU_HOST2GMU_INTR_SET, 1 << request);
>>>>
>>>> -       /* Wait for the acknowledge interrupt */
>>>> -       ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_GMU2HOST_INTR_INFO, val,
>>>> -               val & (1 << ack), 100, 10000);
>>>> +       do {
>>>> +               /* Wait for the acknowledge interrupt */
>>>> +               ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_GMU2HOST_INTR_INFO, val,
>>>> +                       val & (1 << ack), 100, 10000);
>>>> +
>>>> +               if (!ret)
>>>> +                       break;
>>>> +
>>>> +               if (completion_done(&a6xx_gpu->base.fault_coredump_done))
>>
>> I didn't get why this is required. Could you please add a comment?
> 
> Without this, if the GMU timed out for some other reason not related
> to SMMU then we'd loop infinitely. This gives up if there isn't
> currently a crashstate pending.

Ah! That api doc somehow confused me.

> 
>>
>>>> +                       break;
>>>> +
>>>> +               /* We may timeout because the GMU is temporarily wedged from
>>>> +                * pending faults from the GPU and we are taking a devcoredump.
>>>> +                * Wait until the MMU is resumed and try again.
>>>> +                */
>>>> +               wait_for_completion(&a6xx_gpu->base.fault_coredump_done);

use the interruptible version? we may reach here from a process context.

>>>> +       } while (true);
>>>
>>> It is a bit sad to duplicate this nearly identical code twice.  And I
>>> wonder if other gmu_poll_timeout()'s need similar treatment?  Maybe
>>> Akhil has an opinion about whether we should just build this into
>>> gmu_poll_timeout() instead?
>>
>> Yeah. That make sense. A potential issue I see is that we might be
>> holding both gpu and gmu locks here and the crashstate capture in the pf
>> handler tries to lock gpu, which can result in a dead lock.
> 
> I think there would already be a deadlock, or at least timeout in that
> situation now. Any task waiting for the GMU to complete while holding
> the GPU lock would block the crashstate capture from completing and
> allowing the GMU to continue.

Timeout is fine as there is progress eventually. But deadlock is not
acceptable. Also, userspace can easily trigger this deadlock which makes
it a security issue.

I agree, we need to improve the gmu error handling situation overall. I
thought about this a few years ago actually. At that time, I thought it
would be simpler if we always did coredump/recovery from a single
thread. Not sure if that idea still makes sense.

On a related topic, stall-on-fault cannot be used in production. GMU is
very critical as it interacts directly with SoC power management
subsystems and also every year, there is an additional responsibility on
GMU to do a very time critical mitigation like CLX, thermal, BCL etc.
And these mitigations should be handled within a few microseconds. So
GMU should never be blocked, even for microseconds. Apart from that,
even GPU's internal bus can get locked up in rare cases which can lead
to a fatal system bus/NoC error when KMD access a register in the GPU space.

But stall-on-fault is useful while debugging. So any improvements in
this area is useful.

-Akhil.