[PATCH 5/5] drm/msm/dpu: rate limit snapshot capture for mmu faults

Tue Jul 16 23:14:56 UTC 2024

On 7/16/2024 4:10 PM, Dmitry Baryshkov wrote:
> On Wed, 17 Jul 2024 at 01:43, Abhinav Kumar <quic_abhinavk at quicinc.com> wrote:
>>
>>
>>
>> On 7/16/2024 2:50 PM, Rob Clark wrote:
>>> On Tue, Jul 16, 2024 at 2:45 PM Abhinav Kumar <quic_abhinavk at quicinc.com> wrote:
>>>>
>>>>
>>>>
>>>> On 7/15/2024 12:51 PM, Rob Clark wrote:
>>>>> On Mon, Jul 1, 2024 at 12:43 PM Dmitry Baryshkov
>>>>> <dmitry.baryshkov at linaro.org> wrote:
>>>>>>
>>>>>> On Fri, Jun 28, 2024 at 02:48:47PM GMT, Abhinav Kumar wrote:
>>>>>>> There is no recovery mechanism in place yet to recover from mmu
>>>>>>> faults for DPU. We can only prevent the faults by making sure there
>>>>>>> is no misconfiguration.
>>>>>>>
>>>>>>> Rate-limit the snapshot capture for mmu faults to once per
>>>>>>> msm_kms_init_aspace() as that should be sufficient to capture
>>>>>>> the snapshot for debugging otherwise there will be a lot of
>>>>>>> dpu snapshots getting captured for the same fault which is
>>>>>>> redundant and also might affect capturing even one snapshot
>>>>>>> accurately.
>>>>>>
>>>>>> Please squash this into the first patch. There is no need to add code
>>>>>> with a known defficiency.
>>>>>>
>>>>>> Also, is there a reason why you haven't used <linux/ratelimit.h> ?
>>>>>
>>>>> So, in some ways devcoredump is ratelimited by userspace needing to
>>>>> clear an existing devcore..
>>>>>
>>>>
>>>> Yes, a new devcoredump device will not be created until the previous one
>>>> is consumed or times out but here I am trying to limit even the DPU
>>>> snapshot capture because DPU register space is really huge and the rate
>>>> at which smmu faults occur is quite fast that its causing instability
>>>> while snapshots are being captured.
>>>>
>>>>> What I'd suggest would be more useful is to limit the devcores to once
>>>>> per atomic update, ie. if display state hasn't changed, maybe an
>>>>> additional devcore isn't useful
>>>>>
>>>>> BR,
>>>>> -R
>>>>>
>>>>
>>>> By display state change, do you mean like the checks we have in
>>>> drm_atomic_crtc_needs_modeset()?
>>>>
>>>> OR do you mean we need to cache the previous (currently picked up by hw)
>>>> state and trigger a new devcores if the new state is different by
>>>> comparing more things?
>>>>
>>>> This will help to reduce the snapshots to unique frame updates but I do
>>>> not think it will reduce the rate enough for the case where DPU did not
>>>> recover from the previous fault.
>>>
>>> I was thinking the easy thing, of just resetting the counter in
>>> msm_atomic_commit_tail().. I suppose we could be clever filter out
>>> updates that only change scanout address.  Or hash the atomic state
>>> and only generate devcoredumps for unique states.  But I'm not sure
>>> how over-complicated we should make this.
>>>
>>> BR,
>>> -R
>>
>> Its a good idea actually and I would also like to keep it simple :)
>>
>> One question, is it okay to assume that all compositors will only issue
>> unique commits? Because we are assuming thats the case with resetting
>> the counter in msm_atomic_commit_tail().
> 
> The compositors use drm_mode_atomic_ioctl() which allocates a new
> state each time. It is impossible to re-submit the same
> drm_atomic_state from userspace.
> 

No, what I meant was, is it okay to assume that a commit is issued only 
when display configuration has changed?

Like if we get multiple commits for the same frame for some reason, 
thats also spam and this approach will not avoid that.

>