<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    Am 19.08.22 um 11:34 schrieb 李真能:<br>
    <blockquote type="cite" cite="mid:103b4a67-385c-68f5-f40f-39bbc1d9f723@kylinos.cn">
      
      <p>在 2022/8/17 19:40, Christian König 写道:<br>
      </p>
      <blockquote type="cite" cite="mid:cc30a694-c784-f42c-bab6-b45c70202c56@amd.com">Am
        17.08.22 um 09:31 schrieb 李真能: <br>
        <blockquote type="cite"> <br>
          在 2022/8/15 21:12, Christian König 写道: <br>
          <blockquote type="cite">Am 15.08.22 um 09:34 schrieb 李真能: <br>
            <blockquote type="cite"> <br>
              在 2022/8/12 18:55, Christian König 写道: <br>
              <blockquote type="cite">Am 11.08.22 um 09:25 schrieb
                Zhenneng Li: <br>
                <blockquote type="cite">Although radeon card fence and
                  wait for gpu to finish processing current batch rings,
                  <br>
                  there is still a corner case that radeon lockup work
                  queue may not be fully flushed, <br>
                  and meanwhile the radeon_suspend_kms() function has
                  called pci_set_power_state() to <br>
                  put device in D3hot state. <br>
                </blockquote>
                <br>
                If I'm not completely mistaken the reset worker uses the
                suspend/resume functionality as well to get the hardware
                into a working state again. <br>
                <br>
                So if I'm not completely mistaken this here would lead
                to a deadlock, please double check that. <br>
              </blockquote>
              <br>
              We have tested many times, there are no deadlock. <br>
            </blockquote>
            <br>
            Testing doesn't tells you anything, you need to audit the
            call paths. <br>
            <br>
            <blockquote type="cite">In which situation, there would lead
              to a deadlock? <br>
            </blockquote>
            <br>
            GPU resets. <br>
          </blockquote>
          <br>
          Although
          flush_delayed_work(&rdev->fence_drv[i].lockup_work)
          will wait for a lockup_work to finish executing the last
          queueing,  but this kernel func haven't get any lock, and
          lockup_work will run in another kernel thread, so I think
          flush_delayed_work could not lead to a deadlock. <br>
          <br>
          Therefor if radeon_gpu_reset is called in another thread when
          radeon_suspend_kms is blocked on flush_delayed_work, there
          could not lead to a deadlock. <br>
        </blockquote>
        <br>
        Ok sounds like you didn't go what I wanted to say. <br>
        <br>
        The key problem is that radeon_gpu_reset() calls
        radeon_suspend() which in turn calls
        rdev->asic->suspend(). <br>
        <br>
        And this function in turn could end up in radeon_suspend_kms()
        again, but I'm not 100% sure about that. <br>
        <br>
        Just double check the order of function called here (e.g. if
        radeon_suspend_kms() call radeon_suspend() or the other way
        around).  Apart from that your patch looks correct to me as
        well. <br>
        <br>
      </blockquote>
      <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
        margin-right:0px; -qt-block-indent:0; text-indent:0px;">radeon_gpu_reset
        will call radeon_suspend, but radeon_suspend will not call
        radeon_suspend_kms, the more detail of call flow, we can see the
        attachement file: radeon-suspend-reset.png.</p>
      <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
        margin-right:0px; -qt-block-indent:0; text-indent:0px;">Sorry  I
        may have mistaken your meaning.<br>
      </p>
    </blockquote>
    <br>
    Ok in this case my memory of the call flow wasn't correct any more.<br>
    <br>
    Feel free to add an Acked-by: Christian König
    <a class="moz-txt-link-rfc2396E" href="mailto:christian.koenig@amd.com"><christian.koenig@amd.com></a> to the patch.<br>
    <br>
    Alex should then pick it up for upstreaming.<br>
    <br>
    Thanks,<br>
    Christian.<br>
    <br>
    <blockquote type="cite" cite="mid:103b4a67-385c-68f5-f40f-39bbc1d9f723@kylinos.cn">
      <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
        margin-right:0px; -qt-block-indent:0; text-indent:0px;"> </p>
      <p>
        <style type="text/css">p, li { white-space: pre-wra</style><br>
      </p>
      <blockquote type="cite" cite="mid:cc30a694-c784-f42c-bab6-b45c70202c56@amd.com">Regards,
        <br>
        Christian. <br>
        <br>
        <blockquote type="cite"> <br>
          <blockquote type="cite"> <br>
            Regards, <br>
            Christian. <br>
            <br>
            <blockquote type="cite"> <br>
              <blockquote type="cite"> <br>
                Regards, <br>
                Christian. <br>
                <br>
                <blockquote type="cite">Per PCI spec rev 4.0 on
                  5.3.1.4.1 D3hot State. <br>
                  <blockquote type="cite">Configuration and Message
                    requests are the only TLPs accepted by a Function in
                    <br>
                    the D3hot state. All other received Requests must be
                    handled as Unsupported Requests, <br>
                    and all received Completions may optionally be
                    handled as Unexpected Completions. <br>
                  </blockquote>
                  This issue will happen in following logs: <br>
                  Unable to handle kernel paging request at virtual
                  address 00008800e0008010 <br>
                  CPU 0 kworker/0:3(131): Oops 0 <br>
                  pc = [<ffffffff811bea5c>]  ra =
                  [<ffffffff81240844>]  ps = 0000 Tainted:
                  G        W <br>
                  pc is at si_gpu_check_soft_reset+0x3c/0x240 <br>
                  ra is at si_dma_is_lockup+0x34/0xd0 <br>
                  v0 = 0000000000000000  t0 = fff08800e0008010  t1 =
                  0000000000010000 <br>
                  t2 = 0000000000008010  t3 = fff00007e3c00000  t4 =
                  fff00007e3c00258 <br>
                  t5 = 000000000000ffff  t6 = 0000000000000001  t7 =
                  fff00007ef078000 <br>
                  s0 = fff00007e3c016e8  s1 = fff00007e3c00000  s2 =
                  fff00007e3c00018 <br>
                  s3 = fff00007e3c00000  s4 = fff00007fff59d80  s5 =
                  0000000000000000 <br>
                  s6 = fff00007ef07bd98 <br>
                  a0 = fff00007e3c00000  a1 = fff00007e3c016e8  a2 =
                  0000000000000008 <br>
                  a3 = 0000000000000001  a4 = 8f5c28f5c28f5c29  a5 =
                  ffffffff810f4338 <br>
                  t8 = 0000000000000275  t9 = ffffffff809b66f8  t10 =
                  ff6769c5d964b800 <br>
                  t11= 000000000000b886  pv = ffffffff811bea20  at =
                  0000000000000000 <br>
                  gp = ffffffff81d89690  sp = 00000000aa814126 <br>
                  Disabling lock debugging due to kernel taint <br>
                  Trace: <br>
                  [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0
                  <br>
                  [<ffffffff81119610>]
                  radeon_fence_check_lockup+0xd0/0x290 <br>
                  [<ffffffff80977010>]
                  process_one_work+0x280/0x550 <br>
                  [<ffffffff80977350>] worker_thread+0x70/0x7c0 <br>
                  [<ffffffff80977410>] worker_thread+0x130/0x7c0 <br>
                  [<ffffffff80982040>] kthread+0x200/0x210 <br>
                  [<ffffffff809772e0>] worker_thread+0x0/0x7c0 <br>
                  [<ffffffff80981f8c>] kthread+0x14c/0x210 <br>
                  [<ffffffff80911658>]
                  ret_from_kernel_thread+0x18/0x20 <br>
                  [<ffffffff80981e40>] kthread+0x0/0x210 <br>
                    Code: ad3e0008  43f0074a  ad7e0018  ad9e0020 
                  8c3001e8 40230101 <br>
                    <88210000> 4821ed21 <br>
                  So force lockup work queue flush to fix this problem.
                  <br>
                  <br>
                  Signed-off-by: Zhenneng Li <a class="moz-txt-link-rfc2396E" href="mailto:lizhenneng@kylinos.cn" moz-do-not-send="true"><lizhenneng@kylinos.cn></a>
                  <br>
                  --- <br>
                    drivers/gpu/drm/radeon/radeon_device.c | 3 +++ <br>
                    1 file changed, 3 insertions(+) <br>
                  <br>
                  diff --git a/drivers/gpu/drm/radeon/radeon_device.c
                  b/drivers/gpu/drm/radeon/radeon_device.c <br>
                  index 15692cb241fc..e608ca26780a 100644 <br>
                  --- a/drivers/gpu/drm/radeon/radeon_device.c <br>
                  +++ b/drivers/gpu/drm/radeon/radeon_device.c <br>
                  @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct
                  drm_device *dev, bool suspend, <br>
                            if (r) { <br>
                                /* delay GPU reset to resume */ <br>
                               
                  radeon_fence_driver_force_completion(rdev, i); <br>
                  +        } else { <br>
                  +            /* finish executing delayed work */ <br>
                  +
                  flush_delayed_work(&rdev->fence_drv[i].lockup_work);
                  <br>
                            } <br>
                        } <br>
                </blockquote>
                <br>
              </blockquote>
            </blockquote>
            <br>
          </blockquote>
        </blockquote>
        <br>
      </blockquote>
    </blockquote>
    <br>
  </body>
</html>