<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 2021-04-21 11:29 a.m., Felix
      Kuehling wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:9ea89549-afb4-ed48-4ff2-0ae740251d59@amd.com">On
      2021-04-21 3:55 a.m., Christian König wrote:
      <br>
      <blockquote type="cite">Am 21.04.21 um 03:20 schrieb Felix
        Kuehling:
        <br>
        <blockquote type="cite">Am 2021-04-20 um 4:21 p.m. schrieb
          Philip Yang:
          <br>
          <blockquote type="cite">Add interface to remove address from
            fault filter ring by resetting
            <br>
            fault ring entry of the fault address timestamp to 0, then
            future vm
            <br>
            fault on the address will be processed to recover.
            <br>
            <br>
            Check fault address from fault ring, add address into fault
            ring and
            <br>
            remove address from fault ring are serialized in same
            interrupt deferred
            <br>
            work, don't have race condition.
            <br>
            <br>
            Signed-off-by: Philip Yang <a class="moz-txt-link-rfc2396E" href="mailto:Philip.Yang@amd.com"><Philip.Yang@amd.com></a>
            <br>
            ---
            <br>
              drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 24
            ++++++++++++++++++++++++
            <br>
              drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h |  2 ++
            <br>
              2 files changed, 26 insertions(+)
            <br>
            <br>
            diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
            b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
            <br>
            index c39ed9eb0987..338e45fa66cb 100644
            <br>
            --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
            <br>
            +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
            <br>
            @@ -387,6 +387,30 @@ bool amdgpu_gmc_filter_faults(struct
            amdgpu_device *adev, uint64_t addr,
            <br>
                  return false;
            <br>
              }
            <br>
              +/**
            <br>
            + * amdgpu_gmc_filter_faults_remove - remove address from VM
            faults filter
            <br>
            + *
            <br>
            + * @adev: amdgpu device structure
            <br>
            + * @addr: address of the VM fault
            <br>
            + * @pasid: PASID of the process causing the fault
            <br>
            + *
            <br>
            + * Remove the address from fault filter, then future vm
            fault on this address
            <br>
            + * will pass to retry fault handler to recover.
            <br>
            + */
            <br>
            +void amdgpu_gmc_filter_faults_remove(struct amdgpu_device
            *adev, uint64_t addr,
            <br>
            +                     uint16_t pasid)
            <br>
            +{
            <br>
            +    struct amdgpu_gmc *gmc = &adev->gmc;
            <br>
            +
            <br>
            +    uint64_t key = addr << 4 | pasid;
            <br>
            +    struct amdgpu_gmc_fault *fault;
            <br>
            +    uint32_t hash;
            <br>
            +
            <br>
            +    hash = hash_64(key, AMDGPU_GMC_FAULT_HASH_ORDER);
            <br>
            +    fault =
            &gmc->fault_ring[gmc->fault_hash[hash].idx];
            <br>
          </blockquote>
          You need to loop over the fault ring to find a fault with the
          matching
          <br>
          key since there may be hash collisions.
          <br>
          <br>
          You also need to make sure you don't break the single link
          list of keys
          <br>
          with the same hash when you remove an entry. I think the
          easier way to
          <br>
          remove an entry without breaking this ring+closed hashing
          structure is
          <br>
          to reset the fault->key rather than the
          fault->timestamp.
          <br>
          <br>
          Finally, you need to add locking to the fault ring structure.
          Currently
          <br>
          it's not protected by any locks because only one thread (the
          interrupt
          <br>
          handler) accesses it. Now you have another thread that can
          remove
          <br>
          entries, so you need to protect it with a lock. If you are
          handling
          <br>
          retry faults, you know that the interrupt handler is really a
          worker
          <br>
          thread, so you can use a mutex or a spin-lock, but it doesn't
          need to be
          <br>
          interrupt-safe.
          <br>
        </blockquote>
        <br>
        I don't think you need a lock at all.
        <br>
        <br>
        Just using cmpxchg() to update the key should do it.
        <br>
        <br>
        Something like this:
        <br>
        <br>
                hash = hash_64(key, AMDGPU_GMC_FAULT_HASH_ORDER);
        <br>
                fault =
        &gmc->fault_ring[gmc->fault_hash[hash].idx];
        <br>
                while (fault->timestamp >= stamp) {
        <br>
                        uint64_t tmp;
        <br>
        <br>
                        cmpxchg(&fault->key, key, 0);
        <br>
      </blockquote>
      <br>
      Good idea. Then we should probably use READ_ONCE and WRITE_ONCE to
      access fault->key in other places.
      <br>
    </blockquote>
    <p>I will use spinlock to protect fault ring and fault hash table
      access, as atomic_cmpxchg cannot work for 52bit key and VG20 need
      access fault ring in interrupt handler and deferred work.</p>
    <p>Regards,</p>
    <p>Philip<br>
    </p>
    <blockquote type="cite" cite="mid:9ea89549-afb4-ed48-4ff2-0ae740251d59@amd.com">
      <br>
      Regards,
      <br>
        Felix
      <br>
      <br>
      <br>
      <blockquote type="cite">
        <br>
                        tmp = fault->timestamp;
        <br>
                        fault = &gmc->fault_ring[fault->next];
        <br>
        <br>
                        /* Check if the entry was reused */
        <br>
                        if (fault->timestamp >= tmp)
        <br>
                                break;
        <br>
                }
        <br>
        <br>
        Regards,
        <br>
        Christian.
        <br>
        <br>
        <blockquote type="cite">
          <br>
          Regards,
          <br>
             Felix
          <br>
          <br>
          <br>
          <blockquote type="cite">+    fault->timestamp = 0;
            <br>
            +}
            <br>
            +
            <br>
              int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev)
            <br>
              {
            <br>
                  int r;
            <br>
            diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
            b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
            <br>
            index 9d11c02a3938..498a7a0d5a9e 100644
            <br>
            --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
            <br>
            +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
            <br>
            @@ -318,6 +318,8 @@ void amdgpu_gmc_agp_location(struct
            amdgpu_device *adev,
            <br>
                               struct amdgpu_gmc *mc);
            <br>
              bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev,
            uint64_t addr,
            <br>
                                uint16_t pasid, uint64_t timestamp);
            <br>
            +void amdgpu_gmc_filter_faults_remove(struct amdgpu_device
            *adev, uint64_t addr,
            <br>
            +                     uint16_t pasid);
            <br>
              int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev);
            <br>
              void amdgpu_gmc_ras_fini(struct amdgpu_device *adev);
            <br>
              int amdgpu_gmc_allocate_vm_inv_eng(struct amdgpu_device
            *adev);
            <br>
          </blockquote>
          _______________________________________________
          <br>
          amd-gfx mailing list
          <br>
          <a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a>
          <br>
<a class="moz-txt-link-freetext" href="https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cfelix.kuehling%40amd.com%7C7d19870014ff40b6d80b08d9049ae8a0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637545885642357593%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=b0LnO9SQPgYskIRH0vCjKebvh%2FYzFfidRne15q2WIXw%3D&amp;reserved=0">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cfelix.kuehling%40amd.com%7C7d19870014ff40b6d80b08d9049ae8a0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637545885642357593%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=b0LnO9SQPgYskIRH0vCjKebvh%2FYzFfidRne15q2WIXw%3D&amp;reserved=0</a>
          <br>
        </blockquote>
        <br>
      </blockquote>
    </blockquote>
  </body>
</html>