<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p><br>
</p>
<div class="moz-cite-prefix">On 3/3/2025 11:21 PM, Felix Kuehling
wrote:<br>
</div>
<blockquote type="cite" cite="mid:af21821a-b22c-40e4-9f17-2a15d4813cd7@amd.com">
<pre wrap="" class="moz-quote-pre">
On 2025-01-31 11:58, Xiaogang.Chen wrote:
</pre>
<blockquote type="cite">
<pre wrap="" class="moz-quote-pre">From: Xiaogang Chen <a class="moz-txt-link-rfc2396E" href="mailto:xiaogang.chen@amd.com"><xiaogang.chen@amd.com></a>
When register a vm range at svm the added vm range may be split into multiple
subranges and/or existing pranges got spitted. The new pranges need validated
and mapped. This patch changes error handling for pranges that fail updating:
</pre>
</blockquote>
<pre wrap="" class="moz-quote-pre">
It may help if you clearly state the problem you're trying to solve to justify the changes in error handling. See more comments inline.
</pre>
</blockquote>
<p>Current way is returning the last sub range error code if it got
issue during migration, validation or map. If the last error is <span style="white-space: pre-wrap">-EAGAIN, but there are other error </span>codes
at middle for other sub ranges we still return <span style="white-space: pre-wrap">-EAGAIN. That causes same procedure repeated until the sub ranges that have other error code becomes the last one. </span></p>
<p><span style="white-space: pre-wrap">I noticed it when looked at large range(more than 100GB) registration which split into multiple sub ranges. There were multiple </span><span style="white-space: pre-wrap">unnecessary </span><span style="white-space: pre-wrap"> repeats until hit return code that is no </span><span style="white-space: pre-wrap">-EAGAIN.</span></p>
<p><span style="white-space: pre-wrap">As you said we may return immediately if hit no </span><span style="white-space: pre-wrap">-EAGAIN, and hope app terminates. But if app does not terminate kfd drive will hold unused pranges until app stops.</span></p>
<blockquote type="cite" cite="mid:af21821a-b22c-40e4-9f17-2a15d4813cd7@amd.com">
<pre wrap="" class="moz-quote-pre">
</pre>
<blockquote type="cite">
<pre wrap="" class="moz-quote-pre">
1: free prange resources and remove it from svms if its updating fails as it
will not be used.
2: return -EAGAIN when all pranges at update_list need redo valid/map,
otherwise return no -EAGAIN error to user space to indicate failure. That
removes unnecessary retries.
Signed-off-by: Xiaogang Chen <a class="moz-txt-link-rfc2396E" href="mailto:xiaogang.chen@amd.com"><xiaogang.chen@amd.com></a>
---
drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 27 +++++++++++++++++++++++----
1 file changed, 23 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index e32e19196f6b..455cb98bf16a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -3716,8 +3716,19 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
out_unlock_range:
mutex_unlock(&prange->migrate_mutex);
- if (r)
- ret = r;
+ /* this prange cannot be migraed, valid or map */
+ if (r) {
+ /* free this prange resources, remove it from svms */
+ svm_range_unlink(prange);
+ svm_range_remove_notifier(prange);
+ svm_range_free(prange, false);
</pre>
</blockquote>
<pre wrap="" class="moz-quote-pre">
Freeing the prange removes SVM mappings from the process. This will break the subsequent execution of the application. In case you were going to return -EAGAIN that's definitely wrong because the application would expect the SVM range to work after a successful retry.
</pre>
</blockquote>
When return <span style="white-space: pre-wrap">-EAGAIN</span> app
will do whole range registration again including rebuild sub ranges.
And at this stage we do not know if subsequent sub ranges will be
success or fail. So I release current sub range resource if it got
error(including <span style="white-space: pre-wrap">-EAGAIN)</span>.
After processing all sub ranges if decide to have app do it again,
the redo procedure will rebuild the released sub ranges.<br>
<blockquote type="cite" cite="mid:af21821a-b22c-40e4-9f17-2a15d4813cd7@amd.com">
<pre wrap="" class="moz-quote-pre">
I'm not sure the resource waste is a valid argument in case of a fatal error. I would expect the application to terminate anyways in this case, which would result in freeing the resources. Do you see a scenario where an application wants to continue running after this function returned a fatal error?
</pre>
</blockquote>
I made a test app to check the behavior of registration of large
range for debugging a real issue. I do not know if real app will
continue to run when hit no <span style="white-space: pre-wrap">-EAGAIN</span>
error code. The purpose here is making driver handle this case more
general.<br>
<blockquote type="cite" cite="mid:af21821a-b22c-40e4-9f17-2a15d4813cd7@amd.com">
<pre wrap="" class="moz-quote-pre">
</pre>
<blockquote type="cite">
<pre wrap="" class="moz-quote-pre">+
+ /* ret got update when any r != -EAGAIN;
+ * return -EAGAIN when all pranges at update_list
+ * need redo valid/map */
+ if (r != -EAGAIN || !ret)
+ ret = r;
</pre>
</blockquote>
<pre wrap="" class="moz-quote-pre">
This is a good point. But the explanation is a bit misleading: "all pranges ... need redo" is not really true. There may also be ranges that were validated successfully. I think the point you're trying to make is this: Don't return -EAGAIN if there was any previous fatal error found.
</pre>
</blockquote>
ok<br>
<blockquote type="cite" cite="mid:af21821a-b22c-40e4-9f17-2a15d4813cd7@amd.com">
<pre wrap="" class="moz-quote-pre">
I could potentially see a different optimization. If you encounter a fatal error, you can skip the rest of the ranges and return the error immediately.
</pre>
</blockquote>
<p>As said above it is a another way to return immediately if hit no
<span style="white-space: pre-wrap">-EAGAIN</span>. but should
kfd driver release unused pragne resources any way?</p>
<p>Regards</p>
<p>Xiaogang<br>
</p>
<blockquote type="cite" cite="mid:af21821a-b22c-40e4-9f17-2a15d4813cd7@amd.com">
<pre wrap="" class="moz-quote-pre">
</pre>
<blockquote type="cite">
<pre wrap="" class="moz-quote-pre">+ }
}
list_for_each_entry(prange, &remap_list, update_list) {
@@ -3729,8 +3740,16 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
if (r)
pr_debug("failed %d on remap svm range\n", r);
mutex_unlock(&prange->migrate_mutex);
- if (r)
- ret = r;
+
+ if (r) {
+ /* remove this prange */
+ svm_range_unlink(prange);
+ svm_range_remove_notifier(prange);
+ svm_range_free(prange, false);
</pre>
</blockquote>
<pre wrap="" class="moz-quote-pre">
Same as above.
Regards,
Felix
</pre>
<blockquote type="cite">
<pre wrap="" class="moz-quote-pre">+
+ if (r != -EAGAIN || !ret)
+ ret = r;
+ }
}
dynamic_svm_range_dump(svms);
</pre>
</blockquote>
</blockquote>
</body>
</html>