[PATCH v2] drm/amdgpu/vkms: relax timer deactivation by hrtimer_try_to_cancel
Christian König
christian.koenig at amd.com
Mon Jul 10 10:16:13 UTC 2023
Am 10.07.23 um 08:38 schrieb Guchun Chen:
> In below thousands of screen rotation loop tests with virtual display
> enabled, a CPU hard lockup issue may happen, leading system to unresponsive
> and crash.
>
> do {
> xrandr --output Virtual --rotate inverted
> xrandr --output Virtual --rotate right
> xrandr --output Virtual --rotate left
> xrandr --output Virtual --rotate normal
> } while (1);
>
> NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
>
> ? hrtimer_run_softirq+0x140/0x140
> ? store_vblank+0xe0/0xe0 [drm]
> hrtimer_cancel+0x15/0x30
> amdgpu_vkms_disable_vblank+0x15/0x30 [amdgpu]
> drm_vblank_disable_and_save+0x185/0x1f0 [drm]
> drm_crtc_vblank_off+0x159/0x4c0 [drm]
> ? record_print_text.cold+0x11/0x11
> ? wait_for_completion_timeout+0x232/0x280
> ? drm_crtc_wait_one_vblank+0x40/0x40 [drm]
> ? bit_wait_io_timeout+0xe0/0xe0
> ? wait_for_completion_interruptible+0x1d7/0x320
> ? mutex_unlock+0x81/0xd0
> amdgpu_vkms_crtc_atomic_disable
>
> It's caused by a stuck in lock dependency in such scenario on different
> CPUs.
>
> CPU1 CPU2
> drm_crtc_vblank_off hrtimer_interrupt
> grab event_lock (irq disabled) __hrtimer_run_queues
> grab vbl_lock/vblank_time_block amdgpu_vkms_vblank_simulate
> amdgpu_vkms_disable_vblank drm_handle_vblank
> hrtimer_cancel grab dev->event_lock
>
> So CPU1 stucks in hrtimer_cancel as timer callback is running endless on
> current clock base, as that timer queue on CPU2 has no chance to finish it
> because of failing to hold the lock. So NMI watchdog will throw the errors
> after its threshold, and all later CPUs are impacted/blocked.
>
> So use hrtimer_try_to_cancel to fix this, as disable_vblank callback
> does not need to wait the handler to finish. And also it's not necessary
> to check the return value of hrtimer_try_to_cancel, because even if it's
> -1 which means current timer callback is running, it will be reprogrammed
> in hrtimer_start with calling enable_vblank to make it works.
>
> v2: only re-arm timer when vblank is enabled (Christian) and add a Fixes
> tag as well
>
> Fixes: 84ec374bd580("drm/amdgpu: create amdgpu_vkms (v4)")
> Cc: stable at vger.kernel.org
> Suggested-by: Christian König <christian.koenig at amd.com>
> Signed-off-by: Guchun Chen <guchun.chen at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 15 ++++++++++++---
> 1 file changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> index 53ff91fc6cf6..44d704306f44 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> @@ -46,7 +46,10 @@ static enum hrtimer_restart amdgpu_vkms_vblank_simulate(struct hrtimer *timer)
> struct amdgpu_crtc *amdgpu_crtc = container_of(timer, struct amdgpu_crtc, vblank_timer);
> struct drm_crtc *crtc = &amdgpu_crtc->base;
> struct amdgpu_vkms_output *output = drm_crtc_to_amdgpu_vkms_output(crtc);
> + struct drm_vblank_crtc *vblank;
> + struct drm_device *dev;
> u64 ret_overrun;
> + unsigned int pipe;
> bool ret;
>
> ret_overrun = hrtimer_forward_now(&amdgpu_crtc->vblank_timer,
> @@ -54,9 +57,15 @@ static enum hrtimer_restart amdgpu_vkms_vblank_simulate(struct hrtimer *timer)
> if (ret_overrun != 1)
> DRM_WARN("%s: vblank timer overrun\n", __func__);
>
> + dev = crtc->dev;
> + pipe = drm_crtc_index(crtc);
> + vblank = &dev->vblank[pipe];
> ret = drm_crtc_handle_vblank(crtc);
> - if (!ret)
> - DRM_ERROR("amdgpu_vkms failure on handling vblank");
> + if (!ret && !READ_ONCE(vblank->enabled)) {
> + /* Don't queue timer again when vblank is disabled. */
> + DRM_WARN("amdgpu_vkms failure on handling vblank\n");
You should probably only print the warning when really an error happened.
Disabling the vblank and not firing the timer again is a perfectly
normal operation.
Apart from that looks good to me,
Christian.
> + return HRTIMER_NORESTART;
> + }
>
> return HRTIMER_RESTART;
> }
> @@ -81,7 +90,7 @@ static void amdgpu_vkms_disable_vblank(struct drm_crtc *crtc)
> {
> struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc);
>
> - hrtimer_cancel(&amdgpu_crtc->vblank_timer);
> + hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer);
> }
>
> static bool amdgpu_vkms_get_vblank_timestamp(struct drm_crtc *crtc,
More information about the amd-gfx
mailing list