[PATCH v3] drm/amdgpu/vkms: relax timer deactivation by hrtimer_try_to_cancel

Tue Jul 11 09:22:52 UTC 2023

Am 11.07.23 um 11:15 schrieb Chen, Guchun:
> [Public]
>
>> -----Original Message-----
>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>> Sent: Tuesday, July 11, 2023 5:09 PM
>> To: Chen, Guchun <Guchun.Chen at amd.com>; amd-
>> gfx at lists.freedesktop.org; Deucher, Alexander
>> <Alexander.Deucher at amd.com>; Zhang, Hawking
>> <Hawking.Zhang at amd.com>; Koenig, Christian
>> <Christian.Koenig at amd.com>; Milinkovic, Dusica
>> <Dusica.Milinkovic at amd.com>; Prica, Nikola <Nikola.Prica at amd.com>; Cui,
>> Flora <Flora.Cui at amd.com>
>> Cc: stable at vger.kernel.org
>> Subject: Re: [PATCH v3] drm/amdgpu/vkms: relax timer deactivation by
>> hrtimer_try_to_cancel
>>
>>
>>
>> Am 11.07.23 um 03:38 schrieb Guchun Chen:
>>> In below thousands of screen rotation loop tests with virtual display
>>> enabled, a CPU hard lockup issue may happen, leading system to
>>> unresponsive and crash.
>>>
>>> do {
>>>      xrandr --output Virtual --rotate inverted
>>>      xrandr --output Virtual --rotate right
>>>      xrandr --output Virtual --rotate left
>>>      xrandr --output Virtual --rotate normal } while (1);
>>>
>>> NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
>>>
>>> ? hrtimer_run_softirq+0x140/0x140
>>> ? store_vblank+0xe0/0xe0 [drm]
>>> hrtimer_cancel+0x15/0x30
>>> amdgpu_vkms_disable_vblank+0x15/0x30 [amdgpu]
>>> drm_vblank_disable_and_save+0x185/0x1f0 [drm]
>>> drm_crtc_vblank_off+0x159/0x4c0 [drm]
>>> ? record_print_text.cold+0x11/0x11
>>> ? wait_for_completion_timeout+0x232/0x280
>>> ? drm_crtc_wait_one_vblank+0x40/0x40 [drm] ?
>>> bit_wait_io_timeout+0xe0/0xe0 ?
>>> wait_for_completion_interruptible+0x1d7/0x320
>>> ? mutex_unlock+0x81/0xd0
>>> amdgpu_vkms_crtc_atomic_disable
>>>
>>> It's caused by a stuck in lock dependency in such scenario on
>>> different CPUs.
>>>
>>> CPU1                                             CPU2
>>> drm_crtc_vblank_off                              hrtimer_interrupt
>>>       grab event_lock (irq disabled)                   __hrtimer_run_queues
>>>           grab vbl_lock/vblank_time_block
>> amdgpu_vkms_vblank_simulate
>>>               amdgpu_vkms_disable_vblank                       drm_handle_vblank
>>>                   hrtimer_cancel                                         grab dev->event_lock
>>>
>>> So CPU1 stucks in hrtimer_cancel as timer callback is running endless
>>> on current clock base, as that timer queue on CPU2 has no chance to
>>> finish it because of failing to hold the lock. So NMI watchdog will
>>> throw the errors after its threshold, and all later CPUs are
>> impacted/blocked.
>>> So use hrtimer_try_to_cancel to fix this, as disable_vblank callback
>>> does not need to wait the handler to finish. And also it's not
>>> necessary to check the return value of hrtimer_try_to_cancel, because
>>> even if it's
>>> -1 which means current timer callback is running, it will be
>>> reprogrammed in hrtimer_start with calling enable_vblank to make it works.
>>>
>>> v2: only re-arm timer when vblank is enabled (Christian) and add a
>>> Fixes tag as well
>>>
>>> v3: drop warn printing (Christian)
>>>
>>> Fixes: 84ec374bd580("drm/amdgpu: create amdgpu_vkms (v4)")
>>> Cc: stable at vger.kernel.org
>>> Suggested-by: Christian König <christian.koenig at amd.com>
>>> Signed-off-by: Guchun Chen <guchun.chen at amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 13 ++++++++++---
>>>    1 file changed, 10 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
>>> index 53ff91fc6cf6..b870c827cbaa 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
>>> @@ -46,7 +46,10 @@ static enum hrtimer_restart
>> amdgpu_vkms_vblank_simulate(struct hrtimer *timer)
>>>      struct amdgpu_crtc *amdgpu_crtc = container_of(timer, struct
>> amdgpu_crtc, vblank_timer);
>>>      struct drm_crtc *crtc = &amdgpu_crtc->base;
>>>      struct amdgpu_vkms_output *output =
>>> drm_crtc_to_amdgpu_vkms_output(crtc);
>>> +   struct drm_vblank_crtc *vblank;
>>> +   struct drm_device *dev;
>>>      u64 ret_overrun;
>>> +   unsigned int pipe;
>>>      bool ret;
>>>
>>>      ret_overrun = hrtimer_forward_now(&amdgpu_crtc->vblank_timer,
>>> @@ -54,9 +57,13 @@ static enum hrtimer_restart
>> amdgpu_vkms_vblank_simulate(struct hrtimer *timer)
>>>      if (ret_overrun != 1)
>>>              DRM_WARN("%s: vblank timer overrun\n", __func__);
>>>
>>> +   dev = crtc->dev;
>>> +   pipe = drm_crtc_index(crtc);
>>> +   vblank = &dev->vblank[pipe];
>>>      ret = drm_crtc_handle_vblank(crtc);
>>> -   if (!ret)
>>> -           DRM_ERROR("amdgpu_vkms failure on handling vblank");
>>> +   /* Don't queue timer again when vblank is disabled. */
>>> +   if (!ret && !READ_ONCE(vblank->enabled))
>>> +           return HRTIMER_NORESTART;
>> When drm_crtc_handle_vblank() returns false when vblank is disabled I think
>> we can simplify this to just removing the error.
>>
>> Regards,
>> Christian.
> Sorry, I didn't get you. What do you mean by "removing the error"?

We should just remove the "DRM_ERROR("amdgpu_vkms failure on handling 
vblank");" message.

When the drm_crtc_handle_vblank() returns false it doesn't really 
indicate a failure, it just indicates that the vblank is disabled and 
shouldn't be re-armed.

Regards,
Christian.

>
> Regards,
> Guchun
>>>      return HRTIMER_RESTART;
>>>    }
>>> @@ -81,7 +88,7 @@ static void amdgpu_vkms_disable_vblank(struct
>> drm_crtc *crtc)
>>>    {
>>>      struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc);
>>>
>>> -   hrtimer_cancel(&amdgpu_crtc->vblank_timer);
>>> +   hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer);
>>>    }
>>>
>>>    static bool amdgpu_vkms_get_vblank_timestamp(struct drm_crtc *crtc,