[PATCH] drm/amdgpu/vkms: relax timer deactivation by hrtimer_try_to_cancel

Mon Jul 10 06:40:35 UTC 2023

[Public]

> -----Original Message-----
> From: Chen, Guchun
> Sent: Friday, July 7, 2023 9:06 AM
> To: Koenig, Christian <Christian.Koenig at amd.com>; amd-
> gfx at lists.freedesktop.org; Deucher, Alexander
> <Alexander.Deucher at amd.com>; Zhang, Hawking
> <Hawking.Zhang at amd.com>; Milinkovic, Dusica
> <Dusica.Milinkovic at amd.com>; Prica, Nikola <Nikola.Prica at amd.com>; Cui,
> Flora <Flora.Cui at amd.com>
> Cc: stable at vger.kernel.org
> Subject: RE: [PATCH] drm/amdgpu/vkms: relax timer deactivation by
> hrtimer_try_to_cancel
>
>
>
> > -----Original Message-----
> > From: Koenig, Christian <Christian.Koenig at amd.com>
> > Sent: Thursday, July 6, 2023 7:25 PM
> > To: Chen, Guchun <Guchun.Chen at amd.com>; amd-
> > gfx at lists.freedesktop.org; Deucher, Alexander
> > <Alexander.Deucher at amd.com>; Zhang, Hawking
> <Hawking.Zhang at amd.com>;
> > Milinkovic, Dusica <Dusica.Milinkovic at amd.com>; Prica, Nikola
> > <Nikola.Prica at amd.com>; Cui, Flora <Flora.Cui at amd.com>
> > Cc: stable at vger.kernel.org
> > Subject: Re: [PATCH] drm/amdgpu/vkms: relax timer deactivation by
> > hrtimer_try_to_cancel
> >
> > Am 06.07.23 um 10:35 schrieb Guchun Chen:
> > > In below thousands of screen rotation loop tests with virtual
> > > display enabled, a CPU hard lockup issue may happen, leading system
> > > to unresponsive and crash.
> > >
> > > do {
> > >   xrandr --output Virtual --rotate inverted
> > >   xrandr --output Virtual --rotate right
> > >   xrandr --output Virtual --rotate left
> > >   xrandr --output Virtual --rotate normal } while (1);
> > >
> > > NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
> > >
> > > ? hrtimer_run_softirq+0x140/0x140
> > > ? store_vblank+0xe0/0xe0 [drm]
> > > hrtimer_cancel+0x15/0x30
> > > amdgpu_vkms_disable_vblank+0x15/0x30 [amdgpu]
> > > drm_vblank_disable_and_save+0x185/0x1f0 [drm]
> > > drm_crtc_vblank_off+0x159/0x4c0 [drm] ?
> > > record_print_text.cold+0x11/0x11 ?
> > > wait_for_completion_timeout+0x232/0x280
> > > ? drm_crtc_wait_one_vblank+0x40/0x40 [drm] ?
> > > bit_wait_io_timeout+0xe0/0xe0 ?
> > > wait_for_completion_interruptible+0x1d7/0x320
> > > ? mutex_unlock+0x81/0xd0
> > > amdgpu_vkms_crtc_atomic_disable
> > >
> > > It's caused by a stuck in lock dependency in such scenario on
> > > different CPUs.
> > >
> > > CPU1                                             CPU2
> > > drm_crtc_vblank_off                              hrtimer_interrupt
> > >      grab event_lock (irq disabled)                   __hrtimer_run_queues
> > >          grab vbl_lock/vblank_time_block
> > amdgpu_vkms_vblank_simulate
> > >              amdgpu_vkms_disable_vblank                       drm_handle_vblank
> > >                  hrtimer_cancel                                   grab dev->event_lock
> > >
> > > So CPU1 stucks in hrtimer_cancel as timer callback is running
> > > endless on current clock base, as that timer queue on CPU2 has no
> > > chance to finish it because of failing to hold the lock. So NMI
> > > watchdog will throw the errors after its threshold, and all later
> > > CPUs are
> > impacted/blocked.
> > >
> > > So use hrtimer_try_to_cancel to fix this, as disable_vblank callback
> > > does not need to wait the handler to finish. And also it's not
> > > necessary to check the return value of hrtimer_try_to_cancel,
> > > because even if it's
> > > -1 which means current timer callback is running, it will be
> > > reprogrammed in hrtimer_start with calling enable_vblank to make it
> works.
> > >
> > > Cc: stable at vger.kernel.org
> > > Suggested-by: Christian König <christian.koenig at amd.com>
> > > Signed-off-by: Guchun Chen <guchun.chen at amd.com>
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> > > index 53ff91fc6cf6..70fb0df039e3 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
> > > @@ -81,7 +81,7 @@ static void amdgpu_vkms_disable_vblank(struct
> > drm_crtc *crtc)
> > >   {
> > >           struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc);
> > >
> > > - hrtimer_cancel(&amdgpu_crtc->vblank_timer);
> > > + hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer);
> >
> > That's a first step, but not sufficient.
> >
> > You also need to change the "return HRTIMER_RESTART;" in
> > amdgpu_vkms_vblank_simulate() to only re-arm the interrupt when it is
> > enabled.
> >
> > Finally I strongly suggest to implement a amdgpu_vkms_destroy()
> > function to make sure the HRTIMER is properly cleaned up.
>
> Good suggestion, will fix it in V2.
>
Hi Christian,

I just sent out patch v2 to address the return value problem in amdgpu_vkms_vblank_simulate.

Regarding HRTIMER cleanup, it's handled in sw_fini in amdgpu_vkms code, I think so far it's good. Anyway, we can continue the discussion in the new patch set thread.

Regards,
Guchun
> Regards,
> Guchun
> > Regards,
> > Christian.
> >
> > >   }
> > >
> > >   static bool amdgpu_vkms_get_vblank_timestamp(struct drm_crtc
> > > *crtc,