<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p><br>
</p>
<div class="moz-cite-prefix">On 2021-04-21 3:22 a.m., Christian
König wrote:<br>
</div>
<blockquote type="cite" cite="mid:6d4d7698-381a-f1d7-2eed-b71047ddc70d@gmail.com">Am
20.04.21 um 22:21 schrieb Philip Yang:
<br>
<blockquote type="cite">Add interface to remove address from fault
filter ring by resetting
<br>
fault ring entry of the fault address timestamp to 0, then
future vm
<br>
fault on the address will be processed to recover.
<br>
<br>
Check fault address from fault ring, add address into fault ring
and
<br>
remove address from fault ring are serialized in same interrupt
deferred
<br>
work, don't have race condition.
<br>
</blockquote>
<br>
That might not work on Vega20.
<br>
<br>
We call amdgpu_gmc_filter_faults() from the the IH while the fault
handling id done from the delegated IH processing.
<br>
</blockquote>
Added spinlock for VG20.<br>
<blockquote type="cite" cite="mid:6d4d7698-381a-f1d7-2eed-b71047ddc70d@gmail.com">
<br>
More comments below.
<br>
<br>
<blockquote type="cite">Signed-off-by: Philip Yang
<a class="moz-txt-link-rfc2396E" href="mailto:Philip.Yang@amd.com"><Philip.Yang@amd.com></a>
<br>
---
<br>
drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 24
++++++++++++++++++++++++
<br>
drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 2 ++
<br>
2 files changed, 26 insertions(+)
<br>
<br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
<br>
index c39ed9eb0987..338e45fa66cb 100644
<br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
<br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
<br>
@@ -387,6 +387,30 @@ bool amdgpu_gmc_filter_faults(struct
amdgpu_device *adev, uint64_t addr,
<br>
return false;
<br>
}
<br>
+/**
<br>
+ * amdgpu_gmc_filter_faults_remove - remove address from VM
faults filter
<br>
+ *
<br>
+ * @adev: amdgpu device structure
<br>
+ * @addr: address of the VM fault
<br>
+ * @pasid: PASID of the process causing the fault
<br>
+ *
<br>
+ * Remove the address from fault filter, then future vm fault
on this address
<br>
+ * will pass to retry fault handler to recover.
<br>
+ */
<br>
+void amdgpu_gmc_filter_faults_remove(struct amdgpu_device
*adev, uint64_t addr,
<br>
+ uint16_t pasid)
<br>
+{
<br>
+ struct amdgpu_gmc *gmc = &adev->gmc;
<br>
+
<br>
+ uint64_t key = addr << 4 | pasid;
<br>
</blockquote>
<br>
We should probably have a function for this now.
<br>
</blockquote>
add function fault_key in v2.<br>
<blockquote type="cite" cite="mid:6d4d7698-381a-f1d7-2eed-b71047ddc70d@gmail.com">
<br>
<blockquote type="cite">+ struct amdgpu_gmc_fault *fault;
<br>
+ uint32_t hash;
<br>
+
<br>
+ hash = hash_64(key, AMDGPU_GMC_FAULT_HASH_ORDER);
<br>
+ fault =
&gmc->fault_ring[gmc->fault_hash[hash].idx];
<br>
+ fault->timestamp = 0;
<br>
</blockquote>
<br>
There is no guarantee that the ring entry you found for the fault
is the one for this address.
<br>
<br>
After all that is just an 8 bit hash for a 64bit values :)
<br>
<br>
You need to double check the key and walk the chain by looking at
the next entry to eventually find the right one.
<br>
<br>
</blockquote>
<p>I am not completely understand how fault->next and
gmc->last_fault works, as it keep increasing. Please help
review patch v2.</p>
<p>Thanks,</p>
<p>Philip<br>
</p>
<blockquote type="cite" cite="mid:6d4d7698-381a-f1d7-2eed-b71047ddc70d@gmail.com">Christian.
<br>
<br>
<blockquote type="cite">+}
<br>
+
<br>
int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev)
<br>
{
<br>
int r;
<br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
<br>
index 9d11c02a3938..498a7a0d5a9e 100644
<br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
<br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
<br>
@@ -318,6 +318,8 @@ void amdgpu_gmc_agp_location(struct
amdgpu_device *adev,
<br>
struct amdgpu_gmc *mc);
<br>
bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev,
uint64_t addr,
<br>
uint16_t pasid, uint64_t timestamp);
<br>
+void amdgpu_gmc_filter_faults_remove(struct amdgpu_device
*adev, uint64_t addr,
<br>
+ uint16_t pasid);
<br>
int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev);
<br>
void amdgpu_gmc_ras_fini(struct amdgpu_device *adev);
<br>
int amdgpu_gmc_allocate_vm_inv_eng(struct amdgpu_device
*adev);
<br>
</blockquote>
<br>
</blockquote>
</body>
</html>