<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">在 2022/8/17 19:40, Christian König 写道:<br>
    </div>
    <blockquote type="cite"
      cite="mid:cc30a694-c784-f42c-bab6-b45c70202c56@amd.com">Am
      17.08.22 um 09:31 schrieb 李真能:
      <br>
      <blockquote type="cite">
        <br>
        在 2022/8/15 21:12, Christian König 写道:
        <br>
        <blockquote type="cite">Am 15.08.22 um 09:34 schrieb 李真能:
          <br>
          <blockquote type="cite">
            <br>
            在 2022/8/12 18:55, Christian König 写道:
            <br>
            <blockquote type="cite">Am 11.08.22 um 09:25 schrieb
              Zhenneng Li:
              <br>
              <blockquote type="cite">Although radeon card fence and
                wait for gpu to finish processing current batch rings,
                <br>
                there is still a corner case that radeon lockup work
                queue may not be fully flushed,
                <br>
                and meanwhile the radeon_suspend_kms() function has
                called pci_set_power_state() to
                <br>
                put device in D3hot state.
                <br>
              </blockquote>
              <br>
              If I'm not completely mistaken the reset worker uses the
              suspend/resume functionality as well to get the hardware
              into a working state again.
              <br>
              <br>
              So if I'm not completely mistaken this here would lead to
              a deadlock, please double check that.
              <br>
            </blockquote>
            <br>
            We have tested many times, there are no deadlock.
            <br>
          </blockquote>
          <br>
          Testing doesn't tells you anything, you need to audit the call
          paths.
          <br>
          <br>
          <blockquote type="cite">In which situation, there would lead
            to a deadlock?
            <br>
          </blockquote>
          <br>
          GPU resets.
          <br>
        </blockquote>
        <br>
        Although
        flush_delayed_work(&rdev->fence_drv[i].lockup_work) will
        wait for a lockup_work to finish executing the last queueing, 
        but this kernel func haven't get any lock, and lockup_work will
        run in another kernel thread, so I think flush_delayed_work
        could not lead to a deadlock.
        <br>
        <br>
        Therefor if radeon_gpu_reset is called in another thread when
        radeon_suspend_kms is blocked on flush_delayed_work, there could
        not lead to a deadlock.
        <br>
      </blockquote>
      <br>
      Ok sounds like you didn't go what I wanted to say.
      <br>
      <br>
      The key problem is that radeon_gpu_reset() calls radeon_suspend()
      which in turn calls rdev->asic->suspend().
      <br>
      <br>
      And this function in turn could end up in radeon_suspend_kms()
      again, but I'm not 100% sure about that.
      <br>
      <br>
      Just double check the order of function called here (e.g. if
      radeon_suspend_kms() call radeon_suspend() or the other way
      around).  Apart from that your patch looks correct to me as well.
      <br>
      <br>
    </blockquote>
    <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
      margin-right:0px; -qt-block-indent:0; text-indent:0px;">radeon_gpu_reset
      will call radeon_suspend, but radeon_suspend will not call
      radeon_suspend_kms, the more detail of call flow, we can see the
      attachement file: radeon-suspend-reset.png.</p>
    <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px;
      margin-right:0px; -qt-block-indent:0; text-indent:0px;">Sorry  I
      may have mistaken your meaning.<br>
    </p>
    <p>
      <style type="text/css">p, li { white-space: pre-wra</style><br>
    </p>
    <blockquote type="cite"
      cite="mid:cc30a694-c784-f42c-bab6-b45c70202c56@amd.com">Regards,
      <br>
      Christian.
      <br>
      <br>
      <blockquote type="cite">
        <br>
        <blockquote type="cite">
          <br>
          Regards,
          <br>
          Christian.
          <br>
          <br>
          <blockquote type="cite">
            <br>
            <blockquote type="cite">
              <br>
              Regards,
              <br>
              Christian.
              <br>
              <br>
              <blockquote type="cite">Per PCI spec rev 4.0 on 5.3.1.4.1
                D3hot State.
                <br>
                <blockquote type="cite">Configuration and Message
                  requests are the only TLPs accepted by a Function in
                  <br>
                  the D3hot state. All other received Requests must be
                  handled as Unsupported Requests,
                  <br>
                  and all received Completions may optionally be handled
                  as Unexpected Completions.
                  <br>
                </blockquote>
                This issue will happen in following logs:
                <br>
                Unable to handle kernel paging request at virtual
                address 00008800e0008010
                <br>
                CPU 0 kworker/0:3(131): Oops 0
                <br>
                pc = [<ffffffff811bea5c>]  ra =
                [<ffffffff81240844>]  ps = 0000 Tainted: G       
                W
                <br>
                pc is at si_gpu_check_soft_reset+0x3c/0x240
                <br>
                ra is at si_dma_is_lockup+0x34/0xd0
                <br>
                v0 = 0000000000000000  t0 = fff08800e0008010  t1 =
                0000000000010000
                <br>
                t2 = 0000000000008010  t3 = fff00007e3c00000  t4 =
                fff00007e3c00258
                <br>
                t5 = 000000000000ffff  t6 = 0000000000000001  t7 =
                fff00007ef078000
                <br>
                s0 = fff00007e3c016e8  s1 = fff00007e3c00000  s2 =
                fff00007e3c00018
                <br>
                s3 = fff00007e3c00000  s4 = fff00007fff59d80  s5 =
                0000000000000000
                <br>
                s6 = fff00007ef07bd98
                <br>
                a0 = fff00007e3c00000  a1 = fff00007e3c016e8  a2 =
                0000000000000008
                <br>
                a3 = 0000000000000001  a4 = 8f5c28f5c28f5c29  a5 =
                ffffffff810f4338
                <br>
                t8 = 0000000000000275  t9 = ffffffff809b66f8  t10 =
                ff6769c5d964b800
                <br>
                t11= 000000000000b886  pv = ffffffff811bea20  at =
                0000000000000000
                <br>
                gp = ffffffff81d89690  sp = 00000000aa814126
                <br>
                Disabling lock debugging due to kernel taint
                <br>
                Trace:
                <br>
                [<ffffffff81240844>] si_dma_is_lockup+0x34/0xd0
                <br>
                [<ffffffff81119610>]
                radeon_fence_check_lockup+0xd0/0x290
                <br>
                [<ffffffff80977010>] process_one_work+0x280/0x550
                <br>
                [<ffffffff80977350>] worker_thread+0x70/0x7c0
                <br>
                [<ffffffff80977410>] worker_thread+0x130/0x7c0
                <br>
                [<ffffffff80982040>] kthread+0x200/0x210
                <br>
                [<ffffffff809772e0>] worker_thread+0x0/0x7c0
                <br>
                [<ffffffff80981f8c>] kthread+0x14c/0x210
                <br>
                [<ffffffff80911658>]
                ret_from_kernel_thread+0x18/0x20
                <br>
                [<ffffffff80981e40>] kthread+0x0/0x210
                <br>
                  Code: ad3e0008  43f0074a  ad7e0018  ad9e0020  8c3001e8
                40230101
                <br>
                  <88210000> 4821ed21
                <br>
                So force lockup work queue flush to fix this problem.
                <br>
                <br>
                Signed-off-by: Zhenneng Li <a class="moz-txt-link-rfc2396E" href="mailto:lizhenneng@kylinos.cn"><lizhenneng@kylinos.cn></a>
                <br>
                ---
                <br>
                  drivers/gpu/drm/radeon/radeon_device.c | 3 +++
                <br>
                  1 file changed, 3 insertions(+)
                <br>
                <br>
                diff --git a/drivers/gpu/drm/radeon/radeon_device.c
                b/drivers/gpu/drm/radeon/radeon_device.c
                <br>
                index 15692cb241fc..e608ca26780a 100644
                <br>
                --- a/drivers/gpu/drm/radeon/radeon_device.c
                <br>
                +++ b/drivers/gpu/drm/radeon/radeon_device.c
                <br>
                @@ -1604,6 +1604,9 @@ int radeon_suspend_kms(struct
                drm_device *dev, bool suspend,
                <br>
                          if (r) {
                <br>
                              /* delay GPU reset to resume */
                <br>
                              radeon_fence_driver_force_completion(rdev,
                i);
                <br>
                +        } else {
                <br>
                +            /* finish executing delayed work */
                <br>
                +
                flush_delayed_work(&rdev->fence_drv[i].lockup_work);
                <br>
                          }
                <br>
                      }
                <br>
              </blockquote>
              <br>
            </blockquote>
          </blockquote>
          <br>
        </blockquote>
      </blockquote>
      <br>
    </blockquote>
  </body>
</html>