<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 2021-11-10 9:54 a.m., Christian
      König wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:4130bde8-8184-4aed-4634-7d58f3e61d2f@gmail.com">Am
      10.11.21 um 15:44 schrieb philip yang:
      <br>
      <blockquote type="cite">
        <br>
        On 2021-11-10 9:31 a.m., Christian König wrote:
        <br>
        <br>
        <blockquote type="cite">Am 10.11.21 um 14:59 schrieb philip
          yang:
          <br>
          <blockquote type="cite">
            <br>
            On 2021-11-10 5:15 a.m., Christian König wrote:
            <br>
            <br>
            <blockquote type="cite">[SNIP]
              <br>
            </blockquote>
            <br>
            It is hard to understand, this debug log can explain more
            details, with this debug message patch
            <br>
            <br>
            diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
            b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
            <br>
            index ed6f8d24280b..8859f2bb11b1 100644
            <br>
            --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
            <br>
            +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
            <br>
            @@ -234,10 +235,12 @@ int amdgpu_ih_process(struct
            amdgpu_device *adev, struct amdgpu_ih_ring *ih)
            <br>
                            return IRQ_NONE;
            <br>
            <br>
                    wptr = amdgpu_ih_get_wptr(adev, ih);
            <br>
            +       if (ih == &adev->irq.ih1)
            <br>
            +               pr_debug("entering rptr 0x%x, wptr 0x%x\n",
            ih->rptr, wptr);
            <br>
            <br>
             restart_ih:
            <br>
            +       if (ih == &adev->irq.ih1)
            <br>
            +               pr_debug("starting rptr 0x%x, wptr 0x%x\n",
            ih->rptr, wptr);
            <br>
            <br>
                    /* Order reading of wptr vs. reading of IH ring data
            */
            <br>
                    rmb();
            <br>
            @@ -245,8 +248,12 @@ int amdgpu_ih_process(struct
            amdgpu_device *adev, struct amdgpu_ih_ring *ih)
            <br>
                    while (ih->rptr != wptr && --count) {
            <br>
                            amdgpu_irq_dispatch(adev, ih);
            <br>
                            ih->rptr &= ih->ptr_mask;
            <br>
            +               if (ih == &adev->irq.ih1) {
            <br>
            +                       pr_debug("rptr 0x%x, old wptr 0x%x,
            new wptr 0x%x\n",
            <br>
            +                               ih->rptr, wptr,
            <br>
            +                               amdgpu_ih_get_wptr(adev,
            ih));
            <br>
            +               }
            <br>
                    }
            <br>
            <br>
                    amdgpu_ih_set_rptr(adev, ih);
            <br>
            @@ -257,6 +264,8 @@ int amdgpu_ih_process(struct
            amdgpu_device *adev, struct amdgpu_ih_ring *ih)
            <br>
                    if (wptr != ih->rptr)
            <br>
                            goto restart_ih;
            <br>
            <br>
            +       if (ih == &adev->irq.ih1)
            <br>
            +               pr_debug("exiting rptr 0x%x, wptr 0x%x\n",
            ih->rptr, wptr);
            <br>
                    return IRQ_HANDLED;
            <br>
             }
            <br>
            <br>
            This is log, timing 48.807028, ring1 drain is done, rptr ==
            wptr, ring1 is empty, but the loop continues, to handle
            outdated retry fault.
            <br>
            <br>
          </blockquote>
          <br>
          As far as I can see that is perfectly correct and expected
          behavior.
          <br>
          <br>
          See the ring buffer overflowed and because of that the loop
          continues, but that is correct because an overflow means that
          the ring was filled with new entries.
          <br>
          <br>
          So we are processing new entries here, not stale ones.
          <br>
        </blockquote>
        <br>
        wptr is 0x840, 0x860.....0xd20 is not new entries, it is stale
        retry fault, the loop will continue handle stale fault to 0xd20
        and then exit loop, it is because wptr pass rptr after timing
        48.806122.
        <br>
        <br>
      </blockquote>
      <br>
      Yeah, but 0x840..0xd20 still contain perfectly valid IVs which you
      drop on the ground with your approach. That's not something we can
      do.
      <br>
    </blockquote>
    <p>We drain retry fault in unmap from cpu notifier, drain finish
      condition is ring checkpoint is processed, or ring is empty
      rptr=wptr (to fix another issue, IH ring1 do not setup ring wptr
      overflow flag after wptr exceed rptr).</p>
    <p>After drain retry fault returns, we are not expecting retry fault
      on the range as queue should not access the range, so we free
      range as it is unmapped from cpu. From this point of view,
      0x860..0xd20 are stale retry fault.<br>
    </p>
    <blockquote type="cite" cite="mid:4130bde8-8184-4aed-4634-7d58f3e61d2f@gmail.com">
      <br>
      What can happen is that the ring buffer overflows and we process
      the same IV twice, but that is also perfectly expected behavior
      when an overflow happens.
      <br>
    </blockquote>
    <p>After wptr exceed rptr, because no overflow flag, we just drain
      the fault until rptr=wptr, yes, this drops many retry faults, it
      is ok as real retry fault will come in again.</p>
    <p>Regards,</p>
    <p>Philip<br>
    </p>
    <blockquote type="cite" cite="mid:4130bde8-8184-4aed-4634-7d58f3e61d2f@gmail.com">
      <br>
      Regards,
      <br>
      Christian.
      <br>
      <br>
      <blockquote type="cite">Regards.
        <br>
        <br>
        Philip
        <br>
        <br>
        <blockquote type="cite">
          <br>
          Regards,
          <br>
          Christian.
          <br>
          <br>
          <blockquote type="cite">[   48.802231] amdgpu_ih_process:243:
            amdgpu: starting rptr 0x520, wptr 0xd20
            <br>
            [   48.802235] amdgpu_ih_process:254: amdgpu: rptr 0x540,
            old wptr 0xd20, new wptr 0xd20
            <br>
            [   48.802256] amdgpu_ih_process:254: amdgpu: rptr 0x560,
            old wptr 0xd20, new wptr 0xd20
            <br>
            [   48.802260] amdgpu_ih_process:254: amdgpu: rptr 0x580,
            old wptr 0xd20, new wptr 0xd20
            <br>
            [   48.802281] amdgpu_ih_process:254: amdgpu: rptr 0x5a0,
            old wptr 0xd20, new wptr 0xd20
            <br>
            [   48.802314] amdgpu_ih_process:254: amdgpu: rptr 0x5c0,
            old wptr 0xd20, new wptr 0xce0
            <br>
            [   48.802335] amdgpu_ih_process:254: amdgpu: rptr 0x5e0,
            old wptr 0xd20, new wptr 0xce0
            <br>
            [   48.802356] amdgpu_ih_process:254: amdgpu: rptr 0x600,
            old wptr 0xd20, new wptr 0xce0
            <br>
            [   48.802376] amdgpu_ih_process:254: amdgpu: rptr 0x620,
            old wptr 0xd20, new wptr 0xce0
            <br>
            [   48.802396] amdgpu_ih_process:254: amdgpu: rptr 0x640,
            old wptr 0xd20, new wptr 0xce0
            <br>
            [   48.802401] amdgpu_ih_process:254: amdgpu: rptr 0x660,
            old wptr 0xd20, new wptr 0xce0
            <br>
            [   48.802421] amdgpu_ih_process:254: amdgpu: rptr 0x680,
            old wptr 0xd20, new wptr 0xce0
            <br>
            [   48.802442] amdgpu_ih_process:254: amdgpu: rptr 0x6a0,
            old wptr 0xd20, new wptr 0xce0
            <br>
            [   48.802463] amdgpu_ih_process:254: amdgpu: rptr 0x6c0,
            old wptr 0xd20, new wptr 0xce0
            <br>
            [   48.802483] amdgpu_ih_process:254: amdgpu: rptr 0x6e0,
            old wptr 0xd20, new wptr 0xce0
            <br>
            [   48.802503] amdgpu_ih_process:254: amdgpu: rptr 0x700,
            old wptr 0xd20, new wptr 0xce0
            <br>
            [   48.802523] amdgpu_ih_process:254: amdgpu: rptr 0x720,
            old wptr 0xd20, new wptr 0xce0
            <br>
            [   48.802544] amdgpu_ih_process:254: amdgpu: rptr 0x740,
            old wptr 0xd20, new wptr 0xce0
            <br>
            [   48.802565] amdgpu_ih_process:254: amdgpu: rptr 0x760,
            old wptr 0xd20, new wptr 0xce0
            <br>
            [   48.802569] amdgpu_ih_process:254: amdgpu: rptr 0x780,
            old wptr 0xd20, new wptr 0xce0
            <br>
            [   48.804392] amdgpu_ih_process:254: amdgpu: rptr 0x7a0,
            old wptr 0xd20, new wptr 0xf00
            <br>
            [   48.806122] amdgpu_ih_process:254: amdgpu: rptr 0x7c0,
            old wptr 0xd20, new wptr 0x840
            <br>
            [   48.806155] amdgpu_ih_process:254: amdgpu: rptr 0x7e0,
            old wptr 0xd20, new wptr 0x840
            <br>
            [   48.806965] amdgpu_ih_process:254: amdgpu: rptr 0x800,
            old wptr 0xd20, new wptr 0x840
            <br>
            [   48.806995] amdgpu_ih_process:254: amdgpu: rptr 0x820,
            old wptr 0xd20, new wptr 0x840
            <br>
            [   48.807028] amdgpu_ih_process:254: amdgpu: rptr 0x840,
            old wptr 0xd20, new wptr 0x840
            <br>
            [   48.807063] amdgpu_ih_process:254: amdgpu: rptr 0x860,
            old wptr 0xd20, new wptr 0x840
            <br>
            [   48.808421] amdgpu_ih_process:254: amdgpu: rptr 0x880,
            old wptr 0xd20, new wptr 0x840
            <br>
            <br>
            Cause this gpu vm fault dump because address is unmapped
            from cpu.
            <br>
            <br>
            [   48.807071] svm_range_restore_pages:2617: amdgpu:
            restoring svms 0x00000000733bf007 fault address 0x7f8a6991f
            <br>
            <br>
            [   48.807170] svm_range_restore_pages:2631: amdgpu: failed
            to find prange svms 0x00000000733bf007 address [0x7f8a6991f]
            <br>
            [   48.807179] svm_range_get_range_boundaries:2348: amdgpu:
            VMA does not exist in address [0x7f8a6991f]
            <br>
            [   48.807185] svm_range_restore_pages:2635: amdgpu: failed
            to create unregistered range svms 0x00000000733bf007 address
            [0x7f8a6991f]
            <br>
            <br>
            [   48.807929] amdgpu 0000:25:00.0: amdgpu: [mmhub0] retry
            page fault (src_id:0 ring:0 vmid:8 pasid:32770, for process
            kfdtest pid 3969 thread kfdtest pid 3969)
            <br>
            [   48.808219] amdgpu 0000:25:00.0: amdgpu:   in page
            starting at address 0x00007f8a6991f000 from IH client 0x12
            (VMC)
            <br>
            [   48.808230] amdgpu 0000:25:00.0: amdgpu:
            VM_L2_PROTECTION_FAULT_STATUS:0x00800031
            <br>
            <br>
            <blockquote type="cite">We could of course parameterize that
              so that we check the wptr after each IV on IH1, but please
              not hard coded like this.
              <br>
              <br>
              Regards,
              <br>
              Christian.
              <br>
              <br>
              <blockquote type="cite">      }
                <br>
                        amdgpu_ih_set_rptr(adev, ih);
                <br>
              </blockquote>
              <br>
            </blockquote>
          </blockquote>
          <br>
        </blockquote>
      </blockquote>
      <br>
    </blockquote>
  </body>
</html>