[PATCH v3] drm/amdgpu/vkms: relax timer deactivation by hrtimer_try_to_cancel
Christian König
ckoenig.leichtzumerken at gmail.com
Tue Jul 11 09:22:52 UTC 2023
Am 11.07.23 um 11:15 schrieb Chen, Guchun:
> [Public]
>
>> -----Original Message-----
>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>> Sent: Tuesday, July 11, 2023 5:09 PM
>> To: Chen, Guchun <Guchun.Chen at amd.com>; amd-
>> gfx at lists.freedesktop.org; Deucher, Alexander
>> <Alexander.Deucher at amd.com>; Zhang, Hawking
>> <Hawking.Zhang at amd.com>; Koenig, Christian
>> <Christian.Koenig at amd.com>; Milinkovic, Dusica
>> <Dusica.Milinkovic at amd.com>; Prica, Nikola <Nikola.Prica at amd.com>; Cui,
>> Flora <Flora.Cui at amd.com>
>> Cc: stable at vger.kernel.org
>> Subject: Re: [PATCH v3] drm/amdgpu/vkms: relax timer deactivation by
>> hrtimer_try_to_cancel
>>
>>
>>
>> Am 11.07.23 um 03:38 schrieb Guchun Chen:
>>> In below thousands of screen rotation loop tests with virtual display
>>> enabled, a CPU hard lockup issue may happen, leading system to
>>> unresponsive and crash.
>>>
>>> do {
>>> xrandr --output Virtual --rotate inverted
>>> xrandr --output Virtual --rotate right
>>> xrandr --output Virtual --rotate left
>>> xrandr --output Virtual --rotate normal } while (1);
>>>
>>> NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
>>>
>>> ? hrtimer_run_softirq+0x140/0x140
>>> ? store_vblank+0xe0/0xe0 [drm]
>>> hrtimer_cancel+0x15/0x30
>>> amdgpu_vkms_disable_vblank+0x15/0x30 [amdgpu]
>>> drm_vblank_disable_and_save+0x185/0x1f0 [drm]
>>> drm_crtc_vblank_off+0x159/0x4c0 [drm]
>>> ? record_print_text.cold+0x11/0x11
>>> ? wait_for_completion_timeout+0x232/0x280
>>> ? drm_crtc_wait_one_vblank+0x40/0x40 [drm] ?
>>> bit_wait_io_timeout+0xe0/0xe0 ?
>>> wait_for_completion_interruptible+0x1d7/0x320
>>> ? mutex_unlock+0x81/0xd0
>>> amdgpu_vkms_crtc_atomic_disable
>>>
>>> It's caused by a stuck in lock dependency in such scenario on
>>> different CPUs.
>>>
>>> CPU1 CPU2
>>> drm_crtc_vblank_off hrtimer_interrupt
>>> grab event_lock (irq disabled) __hrtimer_run_queues
>>> grab vbl_lock/vblank_time_block
>> amdgpu_vkms_vblank_simulate
>>> amdgpu_vkms_disable_vblank drm_handle_vblank
>>> hrtimer_cancel grab dev->event_lock
>>>
>>> So CPU1 stucks in hrtimer_cancel as timer callback is running endless
>>> on current clock base, as that timer queue on CPU2 has no chance to
>>> finish it because of failing to hold the lock. So NMI watchdog will
>>> throw the errors after its threshold, and all later CPUs are
>> impacted/blocked.
>>> So use hrtimer_try_to_cancel to fix this, as disable_vblank callback
>>> does not need to wait the handler to finish. And also it's not
>>> necessary to check the return value of hrtimer_try_to_cancel, because
>>> even if it's
>>> -1 which means current timer callback is running, it will be
>>> reprogrammed in hrtimer_start with calling enable_vblank to make it works.
>>>
>>> v2: only re-arm timer when vblank is enabled (Christian) and add a
>>> Fixes tag as well
>>>
>>> v3: drop warn printing (Christian)
>>>
>>> Fixes: 84ec374bd580("drm/amdgpu: create amdgpu_vkms (v4)")
>>> Cc: stable at vger.kernel.org
>>> Suggested-by: Christian König <christian.koenig at amd.com>
>>> Signed-off-by: Guchun Chen <guchun.chen at amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 13 ++++++++++---
>>> 1 file changed, 10 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
>>> index 53ff91fc6cf6..b870c827cbaa 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
>>> @@ -46,7 +46,10 @@ static enum hrtimer_restart
>> amdgpu_vkms_vblank_simulate(struct hrtimer *timer)
>>> struct amdgpu_crtc *amdgpu_crtc = container_of(timer, struct
>> amdgpu_crtc, vblank_timer);
>>> struct drm_crtc *crtc = &amdgpu_crtc->base;
>>> struct amdgpu_vkms_output *output =
>>> drm_crtc_to_amdgpu_vkms_output(crtc);
>>> + struct drm_vblank_crtc *vblank;
>>> + struct drm_device *dev;
>>> u64 ret_overrun;
>>> + unsigned int pipe;
>>> bool ret;
>>>
>>> ret_overrun = hrtimer_forward_now(&amdgpu_crtc->vblank_timer,
>>> @@ -54,9 +57,13 @@ static enum hrtimer_restart
>> amdgpu_vkms_vblank_simulate(struct hrtimer *timer)
>>> if (ret_overrun != 1)
>>> DRM_WARN("%s: vblank timer overrun\n", __func__);
>>>
>>> + dev = crtc->dev;
>>> + pipe = drm_crtc_index(crtc);
>>> + vblank = &dev->vblank[pipe];
>>> ret = drm_crtc_handle_vblank(crtc);
>>> - if (!ret)
>>> - DRM_ERROR("amdgpu_vkms failure on handling vblank");
>>> + /* Don't queue timer again when vblank is disabled. */
>>> + if (!ret && !READ_ONCE(vblank->enabled))
>>> + return HRTIMER_NORESTART;
>> When drm_crtc_handle_vblank() returns false when vblank is disabled I think
>> we can simplify this to just removing the error.
>>
>> Regards,
>> Christian.
> Sorry, I didn't get you. What do you mean by "removing the error"?
We should just remove the "DRM_ERROR("amdgpu_vkms failure on handling
vblank");" message.
When the drm_crtc_handle_vblank() returns false it doesn't really
indicate a failure, it just indicates that the vblank is disabled and
shouldn't be re-armed.
Regards,
Christian.
>
> Regards,
> Guchun
>>> return HRTIMER_RESTART;
>>> }
>>> @@ -81,7 +88,7 @@ static void amdgpu_vkms_disable_vblank(struct
>> drm_crtc *crtc)
>>> {
>>> struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc);
>>>
>>> - hrtimer_cancel(&amdgpu_crtc->vblank_timer);
>>> + hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer);
>>> }
>>>
>>> static bool amdgpu_vkms_get_vblank_timestamp(struct drm_crtc *crtc,
More information about the amd-gfx
mailing list