[PATCH v2] drm/msm/dpu: make "vblank timeout" more useful

Tue Feb 20 23:04:25 UTC 2024

On 2/20/2024 2:42 PM, Dmitry Baryshkov wrote:
> On Wed, 21 Feb 2024 at 00:40, Abhinav Kumar <quic_abhinavk at quicinc.com> wrote:
>>
>>
>>
>> On 2/19/2024 3:52 AM, Dmitry Baryshkov wrote:
>>> On Wed, 14 Feb 2024 at 22:36, Abhinav Kumar <quic_abhinavk at quicinc.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2/14/2024 11:20 AM, Dmitry Baryshkov wrote:
>>>>> On Wed, 14 Feb 2024 at 20:02, Abhinav Kumar <quic_abhinavk at quicinc.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2/8/2024 6:50 AM, Dmitry Baryshkov wrote:
>>>>>>> We have several reports of vblank timeout messages. However after some
>>>>>>> debugging it was found that there might be different causes to that.
>>>>>>> To allow us to identify the DPU block that gets stuck, include the
>>>>>>> actual CTL_FLUSH value into the timeout message and trigger the devcore
>>>>>>> snapshot capture.
>>>>>>>
>>>>>>> Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov at linaro.org>
>>>>>>> ---
>>>>>>> Changes in v2:
>>>>>>> - Added a call to msm_disp_snapshot_state() to trigger devcore dump
>>>>>>>       (Abhinav)
>>>>>>> - Link to v1: https://lore.kernel.org/r/20240106-fd-dpu-debug-timeout-v1-1-6d9762884641@linaro.org
>>>>>>> ---
>>>>>>>      drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c | 3 ++-
>>>>>>>      1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c
>>>>>>> index d0f56c5c4cce..a8d6165b3c0a 100644
>>>>>>> --- a/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c
>>>>>>> +++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_encoder_phys_vid.c
>>>>>>> @@ -489,7 +489,8 @@ static int dpu_encoder_phys_vid_wait_for_commit_done(
>>>>>>>                  (hw_ctl->ops.get_flush_register(hw_ctl) == 0),
>>>>>>>                  msecs_to_jiffies(50));
>>>>>>>          if (ret <= 0) {
>>>>>>> -             DPU_ERROR("vblank timeout\n");
>>>>>>> +             DPU_ERROR("vblank timeout: %x\n", hw_ctl->ops.get_flush_register(hw_ctl));
>>>>>>> +             msm_disp_snapshot_state(phys_enc->parent->dev);
>>>>>>
>>>>>>
>>>>>> There is no rate limiting in this piece of code unfortunately. So this
>>>>>> will flood the number of snapshots.
>>>>>
>>>>> Well... Yes and no. The devcoredump will destroy other snapshots if
>>>>> there is a pending one. So only the console will be flooded and only
>>>>> in case when MSM_DISP_SNAPSHOT_DUMP_IN_CONSOLE is enabled.
>>>>>
>>>>
>>>> Yes, true but at the same time this makes it hard to capture a good dump
>>>> as potentially every vblank you could timeout so this destroy/create
>>>> cycle wont end.
>>>
>>> Excuse me, maybe I miss something. On the first timeout the snapshot
>>> is created. It is held by the kernel until it is fully read out from
>>> the userspace. Other snapshots will not interfere with this snapshot.
>>>
>>
>> For every new snapshot a new devcoredump device will be created which
>> should remain till it has been read. But now this will be created every
>> blank. IMO, this is really too much data for no reason.
> 
> No-no-no. If there is a devcoredump for a device, the next one will
> not be created. See dev_coredumpm().
> So all the snapshots will be created and then destroyed immediately.
> 

hmm ... I have certainly seen devcd_count go higher than one (but not 
more than 2). I am wondering whether this was because of some race 
condition of the previous destroy / new create.

But anyway, this part is clear now. thanks.

>>
>> Subsequent vblank timeouts are not going to give any new information
>> compared to the existing snapshot of the first vblank timeout thats why
>> we should just create the snapshot when the first error happens and stop.
>>
>> For other frame done timeouts, infact subsequent timeouts without any
>> sort of recovery in between are quite misleading because hardware was
>> already not able to fetch the previous frame so it will most likely not
>> fetch the next one either till it has recovered. Typically thats why
>> these vblank timeouts happen in a flurry as the hardware never really
>> recovered from the first timeout.
>>
>>> Or are you worried that snapshotting takes time, so taking a snapshot
>>> will also interfere with the vblank timings for the next vblank?
>>>
>>
>> Yes this is another point.
> 

snapshots will still be captured every vblank timeout and reading 
through the entire DPU reg space every vblank timeout is certainly 
something we can avoid.