<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p>Thanks for catch up and fix this race condition. It looks good to
me. One minor thing below:<br>
</p>
<div class="moz-cite-prefix">On 3/6/2025 12:03 AM, Emily Deng wrote:<br>
</div>
<blockquote type="cite" cite="mid:20250306060309.3222693-1-Emily.Deng@amd.com">
<pre wrap="" class="moz-quote-pre">Issue:
In the scenario where svm_range_restore_pages is called, but svm->checkpoint_ts
has not been set and the retry fault has not been drained, svm_range_unmap_from_cpu
is triggered and calls svm_range_free. Meanwhile, svm_range_restore_pages
continues execution and reaches svm_range_from_addr. This results in
a "failed to find prange..." error, causing the page recovery to fail.
How to fix:
Move the timestamp check code under the protection of svm->lock.
v2:
Make sure all right locks are released before go out.
v3:
Directly goto out_unlock_svms, and return -EAGAIN.
v4:
Refine code.
Signed-off-by: Emily Deng <a class="moz-txt-link-rfc2396E" href="mailto:Emily.Deng@amd.com"><Emily.Deng@amd.com></a>
---
drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 30 +++++++++++++++-------------
1 file changed, 16 insertions(+), 14 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index d04725583f19..83ac14bf7a7a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -3008,19 +3008,6 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
goto out;
}
- /* check if this page fault time stamp is before svms->checkpoint_ts */
- if (svms->checkpoint_ts[gpuidx] != 0) {
- if (amdgpu_ih_ts_after_or_equal(ts, svms->checkpoint_ts[gpuidx])) {
- pr_debug("draining retry fault, drop fault 0x%llx\n", addr);
- r = 0;
- goto out;
- } else
- /* ts is after svms->checkpoint_ts now, reset svms->checkpoint_ts
- * to zero to avoid following ts wrap around give wrong comparing
- */
- svms->checkpoint_ts[gpuidx] = 0;
- }
-
if (!p->xnack_enabled) {
pr_debug("XNACK not enabled for pasid 0x%x\n", pasid);
r = -EFAULT;
@@ -3040,6 +3027,20 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
mmap_read_lock(mm);
retry_write_locked:
mutex_lock(&svms->lock);
+
+ /* check if this page fault time stamp is before svms->checkpoint_ts */
+ if (svms->checkpoint_ts[gpuidx] != 0) {
+ if (amdgpu_ih_ts_after_or_equal(ts, svms->checkpoint_ts[gpuidx])) {
+ pr_debug("draining retry fault, drop fault 0x%llx\n", addr);
+ r = -EAGAIN;</pre>
</blockquote>
<p>We drop page fault because it is stale, not mean to handle it
again. if return <span style="white-space: pre-wrap">-EAGAIN</span>
we do amdgpu_gmc_filter_faults_remove. If after unmap, user map
same range again we should treat page fault happened at same range
as new one. </p>
<p>Regards</p>
<p>Xiaogang<br>
</p>
<blockquote type="cite" cite="mid:20250306060309.3222693-1-Emily.Deng@amd.com">
<pre wrap="" class="moz-quote-pre">
+ goto out_unlock_svms;
+ } else
+ /* ts is after svms->checkpoint_ts now, reset svms->checkpoint_ts
+ * to zero to avoid following ts wrap around give wrong comparing
+ */
+ svms->checkpoint_ts[gpuidx] = 0;
+ }
+
prange = svm_range_from_addr(svms, addr, NULL);
if (!prange) {
pr_debug("failed to find prange svms 0x%p address [0x%llx]\n",
@@ -3165,7 +3166,8 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
mutex_unlock(&svms->lock);
mmap_read_unlock(mm);
- svm_range_count_fault(node, p, gpuidx);
+ if (r != -EAGAIN)
+ svm_range_count_fault(node, p, gpuidx);
mmput(mm);
out:
</pre>
</blockquote>
</body>
</html>