<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    Am 11.07.23 um 11:31 schrieb Chen, Guchun:<br>
    <blockquote type="cite" cite="mid:DM5PR12MB2469C1707C855DEFAC306744F131A@DM5PR12MB2469.namprd12.prod.outlook.com">
      <pre class="moz-quote-pre" wrap="">[Public]

</pre>
      <blockquote type="cite">
        <pre class="moz-quote-pre" wrap="">-----Original Message-----
From: Christian König <a class="moz-txt-link-rfc2396E" href="mailto:ckoenig.leichtzumerken@gmail.com"><ckoenig.leichtzumerken@gmail.com></a>
Sent: Tuesday, July 11, 2023 5:23 PM
To: Chen, Guchun <a class="moz-txt-link-rfc2396E" href="mailto:Guchun.Chen@amd.com"><Guchun.Chen@amd.com></a>; amd-
<a class="moz-txt-link-abbreviated" href="mailto:gfx@lists.freedesktop.org">gfx@lists.freedesktop.org</a>; Deucher, Alexander
<a class="moz-txt-link-rfc2396E" href="mailto:Alexander.Deucher@amd.com"><Alexander.Deucher@amd.com></a>; Zhang, Hawking
<a class="moz-txt-link-rfc2396E" href="mailto:Hawking.Zhang@amd.com"><Hawking.Zhang@amd.com></a>; Koenig, Christian
<a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com"><Christian.Koenig@amd.com></a>; Milinkovic, Dusica
<a class="moz-txt-link-rfc2396E" href="mailto:Dusica.Milinkovic@amd.com"><Dusica.Milinkovic@amd.com></a>; Prica, Nikola <a class="moz-txt-link-rfc2396E" href="mailto:Nikola.Prica@amd.com"><Nikola.Prica@amd.com></a>; Cui,
Flora <a class="moz-txt-link-rfc2396E" href="mailto:Flora.Cui@amd.com"><Flora.Cui@amd.com></a>
Cc: <a class="moz-txt-link-abbreviated" href="mailto:stable@vger.kernel.org">stable@vger.kernel.org</a>
Subject: Re: [PATCH v3] drm/amdgpu/vkms: relax timer deactivation by
hrtimer_try_to_cancel

Am 11.07.23 um 11:15 schrieb Chen, Guchun:
</pre>
        <blockquote type="cite">
          <pre class="moz-quote-pre" wrap="">[Public]

</pre>
          <blockquote type="cite">
            <pre class="moz-quote-pre" wrap="">-----Original Message-----
From: Christian König <a class="moz-txt-link-rfc2396E" href="mailto:ckoenig.leichtzumerken@gmail.com"><ckoenig.leichtzumerken@gmail.com></a>
Sent: Tuesday, July 11, 2023 5:09 PM
To: Chen, Guchun <a class="moz-txt-link-rfc2396E" href="mailto:Guchun.Chen@amd.com"><Guchun.Chen@amd.com></a>; amd-
<a class="moz-txt-link-abbreviated" href="mailto:gfx@lists.freedesktop.org">gfx@lists.freedesktop.org</a>; Deucher, Alexander
<a class="moz-txt-link-rfc2396E" href="mailto:Alexander.Deucher@amd.com"><Alexander.Deucher@amd.com></a>; Zhang, Hawking
</pre>
          </blockquote>
        </blockquote>
        <pre class="moz-quote-pre" wrap=""><a class="moz-txt-link-rfc2396E" href="mailto:Hawking.Zhang@amd.com"><Hawking.Zhang@amd.com></a>;
</pre>
        <blockquote type="cite">
          <blockquote type="cite">
            <pre class="moz-quote-pre" wrap="">Koenig, Christian <a class="moz-txt-link-rfc2396E" href="mailto:Christian.Koenig@amd.com"><Christian.Koenig@amd.com></a>; Milinkovic, Dusica
<a class="moz-txt-link-rfc2396E" href="mailto:Dusica.Milinkovic@amd.com"><Dusica.Milinkovic@amd.com></a>; Prica, Nikola <a class="moz-txt-link-rfc2396E" href="mailto:Nikola.Prica@amd.com"><Nikola.Prica@amd.com></a>;
Cui, Flora <a class="moz-txt-link-rfc2396E" href="mailto:Flora.Cui@amd.com"><Flora.Cui@amd.com></a>
Cc: <a class="moz-txt-link-abbreviated" href="mailto:stable@vger.kernel.org">stable@vger.kernel.org</a>
Subject: Re: [PATCH v3] drm/amdgpu/vkms: relax timer deactivation by
hrtimer_try_to_cancel



Am 11.07.23 um 03:38 schrieb Guchun Chen:
</pre>
            <blockquote type="cite">
              <pre class="moz-quote-pre" wrap="">In below thousands of screen rotation loop tests with virtual
display enabled, a CPU hard lockup issue may happen, leading system
to unresponsive and crash.

do {
     xrandr --output Virtual --rotate inverted
     xrandr --output Virtual --rotate right
     xrandr --output Virtual --rotate left
     xrandr --output Virtual --rotate normal } while (1);

NMI watchdog: Watchdog detected hard LOCKUP on cpu 1

? hrtimer_run_softirq+0x140/0x140
? store_vblank+0xe0/0xe0 [drm]
hrtimer_cancel+0x15/0x30
amdgpu_vkms_disable_vblank+0x15/0x30 [amdgpu]
drm_vblank_disable_and_save+0x185/0x1f0 [drm]
drm_crtc_vblank_off+0x159/0x4c0 [drm] ?
record_print_text.cold+0x11/0x11 ?
wait_for_completion_timeout+0x232/0x280
? drm_crtc_wait_one_vblank+0x40/0x40 [drm] ?
bit_wait_io_timeout+0xe0/0xe0 ?
wait_for_completion_interruptible+0x1d7/0x320
? mutex_unlock+0x81/0xd0
amdgpu_vkms_crtc_atomic_disable

It's caused by a stuck in lock dependency in such scenario on
different CPUs.

CPU1                                             CPU2
drm_crtc_vblank_off                              hrtimer_interrupt
      grab event_lock (irq disabled)                   __hrtimer_run_queues
          grab vbl_lock/vblank_time_block
</pre>
            </blockquote>
            <pre class="moz-quote-pre" wrap="">amdgpu_vkms_vblank_simulate
</pre>
            <blockquote type="cite">
              <pre class="moz-quote-pre" wrap="">              amdgpu_vkms_disable_vblank                       drm_handle_vblank
                  hrtimer_cancel                                         grab dev->event_lock

So CPU1 stucks in hrtimer_cancel as timer callback is running
endless on current clock base, as that timer queue on CPU2 has no
chance to finish it because of failing to hold the lock. So NMI
watchdog will throw the errors after its threshold, and all later
CPUs are
</pre>
            </blockquote>
            <pre class="moz-quote-pre" wrap="">impacted/blocked.
</pre>
            <blockquote type="cite">
              <pre class="moz-quote-pre" wrap="">So use hrtimer_try_to_cancel to fix this, as disable_vblank callback
does not need to wait the handler to finish. And also it's not
necessary to check the return value of hrtimer_try_to_cancel,
because even if it's
-1 which means current timer callback is running, it will be
reprogrammed in hrtimer_start with calling enable_vblank to make it
</pre>
            </blockquote>
          </blockquote>
        </blockquote>
        <pre class="moz-quote-pre" wrap="">works.
</pre>
        <blockquote type="cite">
          <blockquote type="cite">
            <blockquote type="cite">
              <pre class="moz-quote-pre" wrap="">
v2: only re-arm timer when vblank is enabled (Christian) and add a
Fixes tag as well

v3: drop warn printing (Christian)

Fixes: 84ec374bd580("drm/amdgpu: create amdgpu_vkms (v4)")
Cc: <a class="moz-txt-link-abbreviated" href="mailto:stable@vger.kernel.org">stable@vger.kernel.org</a>
Suggested-by: Christian König <a class="moz-txt-link-rfc2396E" href="mailto:christian.koenig@amd.com"><christian.koenig@amd.com></a>
Signed-off-by: Guchun Chen <a class="moz-txt-link-rfc2396E" href="mailto:guchun.chen@amd.com"><guchun.chen@amd.com></a>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c | 13 ++++++++++---
   1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
index 53ff91fc6cf6..b870c827cbaa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vkms.c
@@ -46,7 +46,10 @@ static enum hrtimer_restart
</pre>
            </blockquote>
            <pre class="moz-quote-pre" wrap="">amdgpu_vkms_vblank_simulate(struct hrtimer *timer)
</pre>
            <blockquote type="cite">
              <pre class="moz-quote-pre" wrap="">     struct amdgpu_crtc *amdgpu_crtc = container_of(timer, struct
</pre>
            </blockquote>
            <pre class="moz-quote-pre" wrap="">amdgpu_crtc, vblank_timer);
</pre>
            <blockquote type="cite">
              <pre class="moz-quote-pre" wrap="">     struct drm_crtc *crtc = &amdgpu_crtc->base;
     struct amdgpu_vkms_output *output =
drm_crtc_to_amdgpu_vkms_output(crtc);
+   struct drm_vblank_crtc *vblank;
+   struct drm_device *dev;
     u64 ret_overrun;
+   unsigned int pipe;
     bool ret;

     ret_overrun = hrtimer_forward_now(&amdgpu_crtc->vblank_timer,
@@ -54,9 +57,13 @@ static enum hrtimer_restart
</pre>
            </blockquote>
            <pre class="moz-quote-pre" wrap="">amdgpu_vkms_vblank_simulate(struct hrtimer *timer)
</pre>
            <blockquote type="cite">
              <pre class="moz-quote-pre" wrap="">     if (ret_overrun != 1)
             DRM_WARN("%s: vblank timer overrun\n", __func__);

+   dev = crtc->dev;
+   pipe = drm_crtc_index(crtc);
+   vblank = &dev->vblank[pipe];
     ret = drm_crtc_handle_vblank(crtc);
-   if (!ret)
-           DRM_ERROR("amdgpu_vkms failure on handling vblank");
+   /* Don't queue timer again when vblank is disabled. */
+   if (!ret && !READ_ONCE(vblank->enabled))
+           return HRTIMER_NORESTART;
</pre>
            </blockquote>
            <pre class="moz-quote-pre" wrap="">When drm_crtc_handle_vblank() returns false when vblank is disabled I
think we can simplify this to just removing the error.

Regards,
Christian.
</pre>
          </blockquote>
          <pre class="moz-quote-pre" wrap="">Sorry, I didn't get you. What do you mean by "removing the error"?
</pre>
        </blockquote>
        <pre class="moz-quote-pre" wrap="">
We should just remove the "DRM_ERROR("amdgpu_vkms failure on handling
vblank");" message.

When the drm_crtc_handle_vblank() returns false it doesn't really indicate a
failure, it just indicates that the vblank is disabled and shouldn't be re-armed.

Regards,
Christian.
</pre>
      </blockquote>
      <pre class="moz-quote-pre" wrap="">
drm_crtc_handle_vblank which is a wrapper of drm_handle_vblank, has two early test calls to check if vblank is initialized. Though this will never happen in our case, I still check the value of vblank->enabled when drm_crtc_handle_vblank returns false, and when it's indeed disabled, return HRTIMER_NORESTART to not re-arm timer, otherwise, returning HRTIMER_RESTART when it's going as expected.

Anything wrong or am I misunderstanding it?</pre>
    </blockquote>
    <br>
    The extra check for vblank enabled in the driver is just superfluous
    and a little bit erroneous, see drm_handle_vblank():<br>
    <br>
    <pre><span class="w"> </span><span class="cm">/* Need timestamp lock to prevent concurrent execution with</span>
<span class="cm">        * vblank enable/disable, as this would cause inconsistent</span>
<span class="cm">        * or corrupted timestamps and vblank counts.</span>
<span class="cm">        */</span><span class="w"></span>
<span class="w">        </span><span class="n"><a href="https://elixir.bootlin.com/linux/latest/C/ident/spin_lock">spin_lock</a></span><span class="p">(</span><span class="o">&</span><span class="n">dev</span><span class="o">-></span><span class="n"><a href="https://elixir.bootlin.com/linux/latest/C/ident/vblank_time_lock">vblank_time_lock</a></span><span class="p">);</span><span class="w"></span>

<span class="w">        </span><span class="cm">/* Vblank irq handling disabled. Nothing to do. */</span><span class="w"></span>
<span class="w">        </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n"><a href="https://elixir.bootlin.com/linux/latest/C/ident/vblank">vblank</a></span><span class="o">-></span><span class="n"><a href="https://elixir.bootlin.com/linux/latest/C/ident/enabled">enabled</a></span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w"></span>
<span class="w">                </span><span class="n"><a href="https://elixir.bootlin.com/linux/latest/C/ident/spin_unlock">spin_unlock</a></span><span class="p">(</span><span class="o">&</span><span class="n">dev</span><span class="o">-></span><span class="n"><a href="https://elixir.bootlin.com/linux/latest/C/ident/vblank_time_lock">vblank_time_lock</a></span><span class="p">);</span><span class="w"></span>
<span class="w">                </span><span class="n"><a href="https://elixir.bootlin.com/linux/latest/C/ident/spin_unlock_irqrestore">spin_unlock_irqrestore</a></span><span class="p">(</span><span class="o">&</span><span class="n">dev</span><span class="o">-></span><span class="n"><a href="https://elixir.bootlin.com/linux/latest/C/ident/event_lock">event_lock</a></span><span class="p">,</span><span class="w"> </span><span class="n"><a href="https://elixir.bootlin.com/linux/latest/C/ident/irqflags">irqflags</a></span><span class="p">);</span><span class="w"></span>
<span class="w">                </span><span class="k">return</span><span class="w"> </span><span class="n"><a href="https://elixir.bootlin.com/linux/latest/C/ident/false">false</a></span><span class="p">;</span><span class="w"></span>
<span class="w">        </span><span class="p">}</span><span class="w"></span></pre>
    <span class="p"></span><span class="w"></span><br>
    This function already checks for us if vblank is enabled/disabled
    *and* it also holds the correct locks to do so.<br>
    <br>
    So we should *not* check that again in the driver, especially not
    without holding the correct locks.<br>
    <br>
    Regards,<br>
    Christian.<br>
    <br>
    <blockquote type="cite" cite="mid:DM5PR12MB2469C1707C855DEFAC306744F131A@DM5PR12MB2469.namprd12.prod.outlook.com">
      <pre class="moz-quote-pre" wrap="">

Regards,
Guchun
</pre>
      <blockquote type="cite">
        <blockquote type="cite">
          <pre class="moz-quote-pre" wrap="">
Regards,
Guchun
</pre>
          <blockquote type="cite">
            <blockquote type="cite">
              <pre class="moz-quote-pre" wrap="">     return HRTIMER_RESTART;
   }
@@ -81,7 +88,7 @@ static void amdgpu_vkms_disable_vblank(struct
</pre>
            </blockquote>
            <pre class="moz-quote-pre" wrap="">drm_crtc *crtc)
</pre>
            <blockquote type="cite">
              <pre class="moz-quote-pre" wrap="">   {
     struct amdgpu_crtc *amdgpu_crtc = to_amdgpu_crtc(crtc);

-   hrtimer_cancel(&amdgpu_crtc->vblank_timer);
+   hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer);
   }

   static bool amdgpu_vkms_get_vblank_timestamp(struct drm_crtc
*crtc,
</pre>
            </blockquote>
          </blockquote>
        </blockquote>
      </blockquote>
      <pre class="moz-quote-pre" wrap="">
</pre>
    </blockquote>
    <br>
  </body>
</html>