[PATCH v3] drm/amdgpu/vkms: relax timer deactivation by hrtimer_try_to_cancel

Tue Jul 11 09:59:40 UTC 2023

Am 11.07.23 um 11:31 schrieb Chen, Guchun:
> [Public]
>
>> -----Original Message-----
>> From: Christian König<ckoenig.leichtzumerken at gmail.com>
>> Sent: Tuesday, July 11, 2023 5:23 PM
>> To: Chen, Guchun<Guchun.Chen at amd.com>; amd-
>> gfx at lists.freedesktop.org; Deucher, Alexander
>> <Alexander.Deucher at amd.com>; Zhang, Hawking
>> <Hawking.Zhang at amd.com>; Koenig, Christian
>> <Christian.Koenig at amd.com>; Milinkovic, Dusica
>> <Dusica.Milinkovic at amd.com>; Prica, Nikola<Nikola.Prica at amd.com>; Cui,
>> Flora<Flora.Cui at amd.com>
>> Cc:stable at vger.kernel.org
>> Subject: Re: [PATCH v3] drm/amdgpu/vkms: relax timer deactivation by
>> hrtimer_try_to_cancel
>>
>> Am 11.07.23 um 11:15 schrieb Chen, Guchun:
>>> [Public]
>>>
>>>> -----Original Message-----
>>>> From: Christian König<ckoenig.leichtzumerken at gmail.com>
>>>> Sent: Tuesday, July 11, 2023 5:09 PM
>>>> To: Chen, Guchun<Guchun.Chen at amd.com>; amd-
>>>> gfx at lists.freedesktop.org; Deucher, Alexander
>>>> <Alexander.Deucher at amd.com>; Zhang, Hawking
>> <Hawking.Zhang at amd.com>;
>>>> Koenig, Christian<Christian.Koenig at amd.com>; Milinkovic, Dusica
>>>> <Dusica.Milinkovic at amd.com>; Prica, Nikola<Nikola.Prica at amd.com>;
>>>> Cui, Flora<Flora.Cui at amd.com>
>>>> Cc:stable at vger.kernel.org
>>>> Subject: Re: [PATCH v3] drm/amdgpu/vkms: relax timer deactivation by
>>>> hrtimer_try_to_cancel
>>>>
>>>>
>>>>
>>>> Am 11.07.23 um 03:38 schrieb Guchun Chen:
>>>>> In below thousands of screen rotation loop tests with virtual
>>>>> display enabled, a CPU hard lockup issue may happen, leading system
>>>>> to unresponsive and crash.
>>>>>
>>>>> do {
>>>>>       xrandr --output Virtual --rotate inverted
>>>>>       xrandr --output Virtual --rotate right
>>>>>       xrandr --output Virtual --rotate left
>>>>>       xrandr --output Virtual --rotate normal } while (1);
>>>>>
>>>>> NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
>>>>>
>>>>> ? hrtimer_run_softirq+0x140/0x140
>>>>> ? store_vblank+0xe0/0xe0 [drm]
>>>>> hrtimer_cancel+0x15/0x30
>>>>> amdgpu_vkms_disable_vblank+0x15/0x30 [amdgpu]
>>>>> drm_vblank_disable_and_save+0x185/0x1f0 [drm]
>>>>> drm_crtc_vblank_off+0x159/0x4c0 [drm] ?
>>>>> record_print_text.cold+0x11/0x11 ?
>>>>> wait_for_completion_timeout+0x232/0x280
>>>>> ? drm_crtc_wait_one_vblank+0x40/0x40 [drm] ?
>>>>> bit_wait_io_timeout+0xe0/0xe0 ?
>>>>> wait_for_completion_interruptible+0x1d7/0x320
>>>>> ? mutex_unlock+0x81/0xd0
>>>>> amdgpu_vkms_crtc_atomic_disable
>>>>>
>>>>> It's caused by a stuck in lock dependency in such scenario on
>>>>> different CPUs.
>>>>>
>>>>> CPU1                                             CPU2
>>>>> drm_crtc_vblank_off                              hrtimer_interrupt
>>>>>        grab event_lock (irq disabled)                   __hrtimer_run_queues
>>>>>            grab vbl_lock/vblank_time_block
>>>> amdgpu_vkms_vblank_simulate
>>>>>                amdgpu_vkms_disable_vblank                       drm_handle_vblank
>>>>>                    hrtimer_cancel                                         grab dev->event_lock
>>>>>
>>>>> So CPU1 stucks in hrtimer_cancel as timer callback is running
>>>>> endless on current clock base, as that timer queue on CPU2 has no
>>>>> chance to finish it because of failing to hold the lock. So NMI
>>>>> watchdog will throw the errors after its threshold, and all later
>>>>> CPUs are
>>>> impacted/blocked.
>>>>> So use hrtimer_try_to_cancel to fix this, as disable_vblank callback
>>>>> does not need to wait the handler to finish. And also it's not
>>>>> necessary to check the return value of hrtimer_try_to_cancel,
>>>>> because even if it's
>>>>> -1 which means current timer callback is running, it will be
>>>>> reprogrammed in hrtimer_start with calling enable_vblank to make it
>> works.
>>>>> v2: only re-arm timer when vblank is enabled (Christian) and add a
>>>>> Fixes tag as well
>>>>>
>>>>> v3: drop warn printing (Christian)
>>>>>
>>>>> Fixes: 84ec374bd580("drm/amdgpu: create amdgpu_vkms (v4)")
>>>>> Cc:stable at vger.kernel.org
>>>>> Suggested-by: Christian König<christian.koenig at amd.com>
>>>>> Signed-off-by: Guchun Chen<guchun.chen at amd.com>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 13 ++++++++++---
>>>>>     1 file changed, 10 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
>>>>> index 53ff91fc6cf6..b870c827cbaa 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
>>>>> @@ -46,7 +46,10 @@ static enum hrtimer_restart
>>>> amdgpu_vkms_vblank_simulate(struct hrtimer *timer)
>>>>>       struct amdgpu_crtc *amdgpu_crtc = container_of(timer, struct
>>>> amdgpu_crtc, vblank_timer);
>>>>>       struct drm_crtc *crtc = &amdgpu_crtc->base;
>>>>>       struct amdgpu_vkms_output *output =
>>>>> drm_crtc_to_amdgpu_vkms_output(crtc);
>>>>> +   struct drm_vblank_crtc *vblank;
>>>>> +   struct drm_device *dev;
>>>>>       u64 ret_overrun;
>>>>> +   unsigned int pipe;
>>>>>       bool ret;
>>>>>
>>>>>       ret_overrun = hrtimer_forward_now(&amdgpu_crtc->vblank_timer,
>>>>> @@ -54,9 +57,13 @@ static enum hrtimer_restart
>>>> amdgpu_vkms_vblank_simulate(struct hrtimer *timer)
>>>>>       if (ret_overrun != 1)
>>>>>               DRM_WARN("%s: vblank timer overrun\n", __func__);
>>>>>
>>>>> +   dev = crtc->dev;
>>>>> +   pipe = drm_crtc_index(crtc);
>>>>> +   vblank = &dev->vblank[pipe];
>>>>>       ret = drm_crtc_handle_vblank(crtc);
>>>>> -   if (!ret)
>>>>> -           DRM_ERROR("amdgpu_vkms failure on handling vblank");
>>>>> +   /* Don't queue timer again when vblank is disabled. */
>>>>> +   if (!ret && !READ_ONCE(vblank->enabled))
>>>>> +           return HRTIMER_NORESTART;
>>>> When drm_crtc_handle_vblank() returns false when vblank is disabled I
>>>> think we can simplify this to just removing the error.
>>>>
>>>> Regards,
>>>> Christian.
>>> Sorry, I didn't get you. What do you mean by "removing the error"?
>> We should just remove the "DRM_ERROR("amdgpu_vkms failure on handling
>> vblank");" message.
>>
>> When the drm_crtc_handle_vblank() returns false it doesn't really indicate a
>> failure, it just indicates that the vblank is disabled and shouldn't be re-armed.
>>
>> Regards,
>> Christian.
> drm_crtc_handle_vblank which is a wrapper of drm_handle_vblank, has two early test calls to check if vblank is initialized. Though this will never happen in our case, I still check the value of vblank->enabled when drm_crtc_handle_vblank returns false, and when it's indeed disabled, return HRTIMER_NORESTART to not re-arm timer, otherwise, returning HRTIMER_RESTART when it's going as expected.
>
> Anything wrong or am I misunderstanding it?

The extra check for vblank enabled in the driver is just superfluous and 
a little bit erroneous, see drm_handle_vblank():

/* Need timestamp lock to prevent concurrent execution with
* vblank enable/disable, as this would cause inconsistent
* or corrupted timestamps and vblank counts.
*/
spin_lock 
<https://elixir.bootlin.com/linux/latest/C/ident/spin_lock>(&dev->vblank_time_lock 
<https://elixir.bootlin.com/linux/latest/C/ident/vblank_time_lock>);

/* Vblank irq handling disabled. Nothing to do. */
if(!vblank 
<https://elixir.bootlin.com/linux/latest/C/ident/vblank>->enabled 
<https://elixir.bootlin.com/linux/latest/C/ident/enabled>){
spin_unlock 
<https://elixir.bootlin.com/linux/latest/C/ident/spin_unlock>(&dev->vblank_time_lock 
<https://elixir.bootlin.com/linux/latest/C/ident/vblank_time_lock>);
spin_unlock_irqrestore 
<https://elixir.bootlin.com/linux/latest/C/ident/spin_unlock_irqrestore>(&dev->event_lock 
<https://elixir.bootlin.com/linux/latest/C/ident/event_lock>,irqflags 
<https://elixir.bootlin.com/linux/latest/C/ident/irqflags>);
returnfalse <https://elixir.bootlin.com/linux/latest/C/ident/false>;
}

This function already checks for us if vblank is enabled/disabled *and* 
it also holds the correct locks to do so.

So we should *not* check that again in the driver, especially not 
without holding the correct locks.

Regards,
Christian.

>
> Regards,
> Guchun
>>> Regards,
>>> Guchun
>>>>>       return HRTIMER_RESTART;
>>>>>     }
>>>>> @@ -81,7 +88,7 @@ static void amdgpu_vkms_disable_vblank(struct
>>>> drm_crtc *crtc)
>>>>>     {
>>>>>       struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc);
>>>>>
>>>>> -   hrtimer_cancel(&amdgpu_crtc->vblank_timer);
>>>>> +   hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer);
>>>>>     }
>>>>>
>>>>>     static bool amdgpu_vkms_get_vblank_timestamp(struct drm_crtc
>>>>> *crtc,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20230711/d613a0fd/attachment-0001.htm>