<!DOCTYPE html><html><head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body> <div class="moz-cite-prefix">On 3/3/2025 11:21 PM, Felix Kuehling wrote: </div> <blockquote type="cite" cite="mid:af21821a-b22c-40e4-9f17-2a15d4813cd7@amd.com"> <pre wrap="" class="moz-quote-pre"> On 2025-01-31 11:58, Xiaogang.Chen wrote: </pre> <blockquote type="cite"> <pre wrap="" class="moz-quote-pre">From: Xiaogang Chen <a class="moz-txt-link-rfc2396E" href="mailto:xiaogang.chen@amd.com"><xiaogang.chen@amd.com></a> When register a vm range at svm the added vm range may be split into multiple subranges and/or existing pranges got spitted. The new pranges need validated and mapped. This patch changes error handling for pranges that fail updating: </pre> </blockquote> <pre wrap="" class="moz-quote-pre"> It may help if you clearly state the problem you're trying to solve to justify the changes in error handling. See more comments inline. </pre> </blockquote> Current way is returning the last sub range error code if it got issue during migration, validation or map. If the last error is -EAGAIN, but there are other error codes at middle for other sub ranges we still return -EAGAIN. That causes same procedure repeated until the sub ranges that have other error code becomes the last one. I noticed it when looked at large range(more than 100GB) registration which split into multiple sub ranges. There were multiple unnecessary repeats until hit return code that is no -EAGAIN. As you said we may return immediately if hit no -EAGAIN, and hope app terminates. But if app does not terminate kfd drive will hold unused pranges until app stops. <blockquote type="cite" cite="mid:af21821a-b22c-40e4-9f17-2a15d4813cd7@amd.com"> <pre wrap="" class="moz-quote-pre"> </pre> <blockquote type="cite"> <pre wrap="" class="moz-quote-pre"> 1: free prange resources and remove it from svms if its updating fails as it will not be used. 2: return -EAGAIN when all pranges at update_list need redo valid/map, otherwise return no -EAGAIN error to user space to indicate failure. That removes unnecessary retries. Signed-off-by: Xiaogang Chen <a class="moz-txt-link-rfc2396E" href="mailto:xiaogang.chen@amd.com"><xiaogang.chen@amd.com></a> --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 27 +++++++++++++++++++++++---- 1 file changed, 23 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index e32e19196f6b..455cb98bf16a 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -3716,8 +3716,19 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm, out_unlock_range: mutex_unlock(&prange->migrate_mutex); - if (r) - ret = r; + /* this prange cannot be migraed, valid or map */ + if (r) { + /* free this prange resources, remove it from svms */ + svm_range_unlink(prange); + svm_range_remove_notifier(prange); + svm_range_free(prange, false); </pre> </blockquote> <pre wrap="" class="moz-quote-pre"> Freeing the prange removes SVM mappings from the process. This will break the subsequent execution of the application. In case you were going to return -EAGAIN that's definitely wrong because the application would expect the SVM range to work after a successful retry. </pre> </blockquote> When return -EAGAIN app will do whole range registration again including rebuild sub ranges. And at this stage we do not know if subsequent sub ranges will be success or fail. So I release current sub range resource if it got error(including -EAGAIN). After processing all sub ranges if decide to have app do it again, the redo procedure will rebuild the released sub ranges. <blockquote type="cite" cite="mid:af21821a-b22c-40e4-9f17-2a15d4813cd7@amd.com"> <pre wrap="" class="moz-quote-pre"> I'm not sure the resource waste is a valid argument in case of a fatal error. I would expect the application to terminate anyways in this case, which would result in freeing the resources. Do you see a scenario where an application wants to continue running after this function returned a fatal error? </pre> </blockquote> I made a test app to check the behavior of registration of large range for debugging a real issue. I do not know if real app will continue to run when hit no -EAGAIN error code. The purpose here is making driver handle this case more general. <blockquote type="cite" cite="mid:af21821a-b22c-40e4-9f17-2a15d4813cd7@amd.com"> <pre wrap="" class="moz-quote-pre"> </pre> <blockquote type="cite"> <pre wrap="" class="moz-quote-pre">+ + /* ret got update when any r != -EAGAIN; + * return -EAGAIN when all pranges at update_list + * need redo valid/map */ + if (r != -EAGAIN || !ret) + ret = r; </pre> </blockquote> <pre wrap="" class="moz-quote-pre"> This is a good point. But the explanation is a bit misleading: "all pranges ... need redo" is not really true. There may also be ranges that were validated successfully. I think the point you're trying to make is this: Don't return -EAGAIN if there was any previous fatal error found. </pre> </blockquote> ok <blockquote type="cite" cite="mid:af21821a-b22c-40e4-9f17-2a15d4813cd7@amd.com"> <pre wrap="" class="moz-quote-pre"> I could potentially see a different optimization. If you encounter a fatal error, you can skip the rest of the ranges and return the error immediately. </pre> </blockquote> As said above it is a another way to return immediately if hit no -EAGAIN. but should kfd driver release unused pragne resources any way? Regards Xiaogang <blockquote type="cite" cite="mid:af21821a-b22c-40e4-9f17-2a15d4813cd7@amd.com"> <pre wrap="" class="moz-quote-pre"> </pre> <blockquote type="cite"> <pre wrap="" class="moz-quote-pre">+ } } list_for_each_entry(prange, &remap_list, update_list) { @@ -3729,8 +3740,16 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm, if (r) pr_debug("failed %d on remap svm range\n", r); mutex_unlock(&prange->migrate_mutex); - if (r) - ret = r; + + if (r) { + /* remove this prange */ + svm_range_unlink(prange); + svm_range_remove_notifier(prange); + svm_range_free(prange, false); </pre> </blockquote> <pre wrap="" class="moz-quote-pre"> Same as above. Regards, Felix </pre> <blockquote type="cite"> <pre wrap="" class="moz-quote-pre">+ + if (r != -EAGAIN || !ret) + ret = r; + } } dynamic_svm_range_dump(svms); </pre> </blockquote> </blockquote> </body> </html>