<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 2021-04-21 3:22 a.m., Christian
      König wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:6d4d7698-381a-f1d7-2eed-b71047ddc70d@gmail.com">Am
      20.04.21 um 22:21 schrieb Philip Yang:
      <br>
      <blockquote type="cite">Add interface to remove address from fault
        filter ring by resetting
        <br>
        fault ring entry of the fault address timestamp to 0, then
        future vm
        <br>
        fault on the address will be processed to recover.
        <br>
        <br>
        Check fault address from fault ring, add address into fault ring
        and
        <br>
        remove address from fault ring are serialized in same interrupt
        deferred
        <br>
        work, don't have race condition.
        <br>
      </blockquote>
      <br>
      That might not work on Vega20.
      <br>
      <br>
      We call amdgpu_gmc_filter_faults() from the the IH while the fault
      handling id done from the delegated IH processing.
      <br>
    </blockquote>
    Added spinlock for VG20.<br>
    <blockquote type="cite" cite="mid:6d4d7698-381a-f1d7-2eed-b71047ddc70d@gmail.com">
      <br>
      More comments below.
      <br>
      <br>
      <blockquote type="cite">Signed-off-by: Philip Yang
        <a class="moz-txt-link-rfc2396E" href="mailto:Philip.Yang@amd.com"><Philip.Yang@amd.com></a>
        <br>
        ---
        <br>
          drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 24
        ++++++++++++++++++++++++
        <br>
          drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h |  2 ++
        <br>
          2 files changed, 26 insertions(+)
        <br>
        <br>
        diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
        b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
        <br>
        index c39ed9eb0987..338e45fa66cb 100644
        <br>
        --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
        <br>
        +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
        <br>
        @@ -387,6 +387,30 @@ bool amdgpu_gmc_filter_faults(struct
        amdgpu_device *adev, uint64_t addr,
        <br>
              return false;
        <br>
          }
        <br>
          +/**
        <br>
        + * amdgpu_gmc_filter_faults_remove - remove address from VM
        faults filter
        <br>
        + *
        <br>
        + * @adev: amdgpu device structure
        <br>
        + * @addr: address of the VM fault
        <br>
        + * @pasid: PASID of the process causing the fault
        <br>
        + *
        <br>
        + * Remove the address from fault filter, then future vm fault
        on this address
        <br>
        + * will pass to retry fault handler to recover.
        <br>
        + */
        <br>
        +void amdgpu_gmc_filter_faults_remove(struct amdgpu_device
        *adev, uint64_t addr,
        <br>
        +                     uint16_t pasid)
        <br>
        +{
        <br>
        +    struct amdgpu_gmc *gmc = &adev->gmc;
        <br>
        +
        <br>
        +    uint64_t key = addr << 4 | pasid;
        <br>
      </blockquote>
      <br>
      We should probably have a function for this now.
      <br>
    </blockquote>
    add function fault_key in v2.<br>
    <blockquote type="cite" cite="mid:6d4d7698-381a-f1d7-2eed-b71047ddc70d@gmail.com">
      <br>
      <blockquote type="cite">+    struct amdgpu_gmc_fault *fault;
        <br>
        +    uint32_t hash;
        <br>
        +
        <br>
        +    hash = hash_64(key, AMDGPU_GMC_FAULT_HASH_ORDER);
        <br>
        +    fault =
        &gmc->fault_ring[gmc->fault_hash[hash].idx];
        <br>
        +    fault->timestamp = 0;
        <br>
      </blockquote>
      <br>
      There is no guarantee that the ring entry you found for the fault
      is the one for this address.
      <br>
      <br>
      After all that is just an 8 bit hash for a 64bit values :)
      <br>
      <br>
      You need to double check the key and walk the chain by looking at
      the next entry to eventually find the right one.
      <br>
      <br>
    </blockquote>
    <p>I am not completely understand how fault->next and
      gmc->last_fault works, as it keep increasing. Please help
      review patch v2.</p>
    <p>Thanks,</p>
    <p>Philip<br>
    </p>
    <blockquote type="cite" cite="mid:6d4d7698-381a-f1d7-2eed-b71047ddc70d@gmail.com">Christian.
      <br>
      <br>
      <blockquote type="cite">+}
        <br>
        +
        <br>
          int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev)
        <br>
          {
        <br>
              int r;
        <br>
        diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
        b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
        <br>
        index 9d11c02a3938..498a7a0d5a9e 100644
        <br>
        --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
        <br>
        +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
        <br>
        @@ -318,6 +318,8 @@ void amdgpu_gmc_agp_location(struct
        amdgpu_device *adev,
        <br>
                           struct amdgpu_gmc *mc);
        <br>
          bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev,
        uint64_t addr,
        <br>
                            uint16_t pasid, uint64_t timestamp);
        <br>
        +void amdgpu_gmc_filter_faults_remove(struct amdgpu_device
        *adev, uint64_t addr,
        <br>
        +                     uint16_t pasid);
        <br>
          int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev);
        <br>
          void amdgpu_gmc_ras_fini(struct amdgpu_device *adev);
        <br>
          int amdgpu_gmc_allocate_vm_inv_eng(struct amdgpu_device
        *adev);
        <br>
      </blockquote>
      <br>
    </blockquote>
  </body>
</html>