<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p><br>
</p>
<div class="moz-cite-prefix">On 2024-05-02 08:42, James Zhu wrote:<br>
</div>
<blockquote type="cite" cite="mid:a28dddff-5eae-4856-8f87-26d534163a6d@amd.com">
<br>
On 2024-05-01 18:56, Philip Yang wrote:
<br>
<blockquote type="cite">On system with khugepaged enabled and user
cases with THP buffer, the
<br>
hmm_range_fault may takes > 15 seconds to return -EBUSY, the
arbitrary
<br>
timeout value is not accurate, cause memory allocation failure.
<br>
<br>
Remove the arbitrary timeout value, return EAGAIN to application
if
<br>
hmm_range_fault return EBUSY, then userspace libdrm and Thunk
will call
<br>
ioctl again.
<br>
<br>
Change EAGAIN to debug message as this is not error.
<br>
<br>
Signed-off-by: Philip Yang <a class="moz-txt-link-rfc2396E" href="mailto:Philip.Yang@amd.com"><Philip.Yang@amd.com></a>
<br>
---
<br>
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 5 ++++-
<br>
drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 12
+++---------
<br>
drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +----
<br>
3 files changed, 8 insertions(+), 14 deletions(-)
<br>
<br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
<br>
index 54198c3928c7..02696c2102f1 100644
<br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
<br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
<br>
@@ -1087,7 +1087,10 @@ static int init_user_pages(struct kgd_mem
*mem, uint64_t user_addr,
<br>
ret = amdgpu_ttm_tt_get_user_pages(bo,
bo->tbo.ttm->pages, &range);
<br>
if (ret) {
<br>
- pr_err("%s: Failed to get user pages: %d\n", __func__,
ret);
<br>
+ if (ret == -EAGAIN)
<br>
+ pr_debug("Failed to get user pages, try again\n");
<br>
+ else
<br>
+ pr_err("%s: Failed to get user pages: %d\n",
__func__, ret);
<br>
goto unregister_out;
<br>
}
<br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
<br>
index 431ec72655ec..e36fede7f74c 100644
<br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
<br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
<br>
@@ -202,20 +202,12 @@ int amdgpu_hmm_range_get_pages(struct
mmu_interval_notifier *notifier,
<br>
pr_debug("hmm range: start = 0x%lx, end = 0x%lx",
<br>
hmm_range->start, hmm_range->end);
<br>
- /* Assuming 64MB takes maximum 1 second to fault page
address */
<br>
- timeout = max((hmm_range->end - hmm_range->start)
>> 26, 1UL);
<br>
- timeout *= HMM_RANGE_DEFAULT_TIMEOUT;
<br>
- timeout = jiffies + msecs_to_jiffies(timeout);
<br>
+ timeout = jiffies +
msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
<br>
</blockquote>
[JZ] should we reduce MAX_WALK_BYTE to 64M in the meantime?
<br>
</blockquote>
From debug log, the range size is not related, 64MB range may takes
same long time to return EBUSY too.<br>
<blockquote type="cite" cite="mid:a28dddff-5eae-4856-8f87-26d534163a6d@amd.com">
<blockquote type="cite"> retry:
<br>
hmm_range->notifier_seq =
mmu_interval_read_begin(notifier);
<br>
r = hmm_range_fault(hmm_range);
<br>
if (unlikely(r)) {
<br>
- schedule();
<br>
</blockquote>
[JZ] the above is for CPU stall WA, we may still need keep it.
<br>
</blockquote>
<p>The timeout 1 second should be long enough for normal case, if
hmm_range_fault returns EBUSY, we release mmap_read lock and
return to user space, so don't need explicit schedule to fix the
CPU stale warning. Will run overnight KFDTest LargestSysBufferTest
on larger memory system to confirm if there is CPU stale message.</p>
<p>Regards,</p>
<p>Philip<br>
</p>
<blockquote type="cite" cite="mid:a28dddff-5eae-4856-8f87-26d534163a6d@amd.com">
<blockquote type="cite">- /*
<br>
- * FIXME: This timeout should encompass the retry
from
<br>
- * mmu_interval_read_retry() as well.
<br>
- */
<br>
if (r == -EBUSY && !time_after(jiffies,
timeout))
<br>
goto retry;
<br>
goto out_free_pfns;
<br>
@@ -247,6 +239,8 @@ int amdgpu_hmm_range_get_pages(struct
mmu_interval_notifier *notifier,
<br>
out_free_range:
<br>
kfree(hmm_range);
<br>
+ if (r == -EBUSY)
<br>
+ r = -EAGAIN;
<br>
return r;
<br>
}
<br>
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
<br>
index 94f83be2232d..e7040f809f33 100644
<br>
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
<br>
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
<br>
@@ -1670,11 +1670,8 @@ static int
svm_range_validate_and_map(struct mm_struct *mm,
<br>
readonly, owner, NULL,
<br>
&hmm_range);
<br>
WRITE_ONCE(p->svms.faulting_task, NULL);
<br>
- if (r) {
<br>
+ if (r)
<br>
pr_debug("failed %d to get svm range pages\n",
r);
<br>
- if (r == -EBUSY)
<br>
- r = -EAGAIN;
<br>
- }
<br>
} else {
<br>
r = -EFAULT;
<br>
}
<br>
</blockquote>
</blockquote>
</body>
</html>