[PATCH 5/5] drm/msm/dpu: rate limit snapshot capture for mmu faults

Tue Jul 16 22:43:06 UTC 2024

On 7/16/2024 2:50 PM, Rob Clark wrote:
> On Tue, Jul 16, 2024 at 2:45 PM Abhinav Kumar <quic_abhinavk at quicinc.com> wrote:
>>
>>
>>
>> On 7/15/2024 12:51 PM, Rob Clark wrote:
>>> On Mon, Jul 1, 2024 at 12:43 PM Dmitry Baryshkov
>>> <dmitry.baryshkov at linaro.org> wrote:
>>>>
>>>> On Fri, Jun 28, 2024 at 02:48:47PM GMT, Abhinav Kumar wrote:
>>>>> There is no recovery mechanism in place yet to recover from mmu
>>>>> faults for DPU. We can only prevent the faults by making sure there
>>>>> is no misconfiguration.
>>>>>
>>>>> Rate-limit the snapshot capture for mmu faults to once per
>>>>> msm_kms_init_aspace() as that should be sufficient to capture
>>>>> the snapshot for debugging otherwise there will be a lot of
>>>>> dpu snapshots getting captured for the same fault which is
>>>>> redundant and also might affect capturing even one snapshot
>>>>> accurately.
>>>>
>>>> Please squash this into the first patch. There is no need to add code
>>>> with a known defficiency.
>>>>
>>>> Also, is there a reason why you haven't used <linux/ratelimit.h> ?
>>>
>>> So, in some ways devcoredump is ratelimited by userspace needing to
>>> clear an existing devcore..
>>>
>>
>> Yes, a new devcoredump device will not be created until the previous one
>> is consumed or times out but here I am trying to limit even the DPU
>> snapshot capture because DPU register space is really huge and the rate
>> at which smmu faults occur is quite fast that its causing instability
>> while snapshots are being captured.
>>
>>> What I'd suggest would be more useful is to limit the devcores to once
>>> per atomic update, ie. if display state hasn't changed, maybe an
>>> additional devcore isn't useful
>>>
>>> BR,
>>> -R
>>>
>>
>> By display state change, do you mean like the checks we have in
>> drm_atomic_crtc_needs_modeset()?
>>
>> OR do you mean we need to cache the previous (currently picked up by hw)
>> state and trigger a new devcores if the new state is different by
>> comparing more things?
>>
>> This will help to reduce the snapshots to unique frame updates but I do
>> not think it will reduce the rate enough for the case where DPU did not
>> recover from the previous fault.
> 
> I was thinking the easy thing, of just resetting the counter in
> msm_atomic_commit_tail().. I suppose we could be clever filter out
> updates that only change scanout address.  Or hash the atomic state
> and only generate devcoredumps for unique states.  But I'm not sure
> how over-complicated we should make this.
> 
> BR,
> -R

Its a good idea actually and I would also like to keep it simple :)

One question, is it okay to assume that all compositors will only issue 
unique commits? Because we are assuming thats the case with resetting 
the counter in msm_atomic_commit_tail().

> 
>>
>>>>
>>>>>
>>>>> Signed-off-by: Abhinav Kumar <quic_abhinavk at quicinc.com>
>>>>> ---
>>>>>    drivers/gpu/drm/msm/msm_kms.c | 6 +++++-
>>>>>    drivers/gpu/drm/msm/msm_kms.h | 3 +++
>>>>>    2 files changed, 8 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/msm/msm_kms.c b/drivers/gpu/drm/msm/msm_kms.c
>>>>> index d5d3117259cf..90a333920c01 100644
>>>>> --- a/drivers/gpu/drm/msm/msm_kms.c
>>>>> +++ b/drivers/gpu/drm/msm/msm_kms.c
>>>>> @@ -168,7 +168,10 @@ static int msm_kms_fault_handler(void *arg, unsigned long iova, int flags, void
>>>>>    {
>>>>>         struct msm_kms *kms = arg;
>>>>>
>>>>> -     msm_disp_snapshot_state(kms->dev);
>>>>> +     if (!kms->fault_snapshot_capture) {
>>>>> +             msm_disp_snapshot_state(kms->dev);
>>>>> +             kms->fault_snapshot_capture++;
>>>>
>>>> When is it decremented?
>>>>
>>>>> +     }
>>>>>
>>>>>         return -ENOSYS;
>>>>>    }
>>>>> @@ -208,6 +211,7 @@ struct msm_gem_address_space *msm_kms_init_aspace(struct drm_device *dev)
>>>>>                 mmu->funcs->destroy(mmu);
>>>>>         }
>>>>>
>>>>> +     kms->fault_snapshot_capture = 0;
>>>>>         msm_mmu_set_fault_handler(aspace->mmu, kms, msm_kms_fault_handler);
>>>>>
>>>>>         return aspace;
>>>>> diff --git a/drivers/gpu/drm/msm/msm_kms.h b/drivers/gpu/drm/msm/msm_kms.h
>>>>> index 1e0c54de3716..240b39e60828 100644
>>>>> --- a/drivers/gpu/drm/msm/msm_kms.h
>>>>> +++ b/drivers/gpu/drm/msm/msm_kms.h
>>>>> @@ -134,6 +134,9 @@ struct msm_kms {
>>>>>         int irq;
>>>>>         bool irq_requested;
>>>>>
>>>>> +     /* rate limit the snapshot capture to once per attach */
>>>>> +     int fault_snapshot_capture;
>>>>> +
>>>>>         /* mapper-id used to request GEM buffer mapped for scanout: */
>>>>>         struct msm_gem_address_space *aspace;
>>>>>
>>>>> --
>>>>> 2.44.0
>>>>>
>>>>
>>>> --
>>>> With best wishes
>>>> Dmitry