[PATCH] drm/amdkfd: Change error handling at prange update in svm_range_set_attr

Tue Mar 4 18:23:11 UTC 2025

On 3/3/2025 11:21 PM, Felix Kuehling wrote:
> On 2025-01-31 11:58, Xiaogang.Chen wrote:
>> From: Xiaogang Chen<xiaogang.chen at amd.com>
>>
>> When register a vm range at svm the added vm range may be split into multiple
>> subranges and/or existing pranges got spitted. The new pranges need validated
>> and mapped. This patch changes error handling for pranges that fail updating:
> It may help if you clearly state the problem you're trying to solve to justify the changes in error handling. See more comments inline.
>
Current way is returning the last sub range error code if it got issue 
during migration, validation or map. If the last error is -EAGAIN, but 
there are other error codes at middle for other sub ranges we still 
return -EAGAIN. That causes same procedure repeated until the sub ranges 
that have other error code becomes the last one.

I noticed it when looked at large range(more than 100GB) registration 
which split into multiple sub ranges. There were multiple unnecessary 
repeats until hit return code that is no -EAGAIN.

As you said we may return immediately if hit no -EAGAIN, and hope app 
terminates. But if app does not terminate kfd drive will hold unused 
pranges until app stops.

>> 1: free prange resources and remove it from svms if its updating fails as it
>> will not be used.
>> 2: return -EAGAIN when all pranges at update_list need redo valid/map,
>> otherwise return no -EAGAIN error to user space to indicate failure. That
>> removes unnecessary retries.
>>
>> Signed-off-by: Xiaogang Chen<xiaogang.chen at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 27 +++++++++++++++++++++++----
>>   1 file changed, 23 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> index e32e19196f6b..455cb98bf16a 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> @@ -3716,8 +3716,19 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
>>   
>>   out_unlock_range:
>>   		mutex_unlock(&prange->migrate_mutex);
>> -		if (r)
>> -			ret = r;
>> +		/* this prange cannot be migraed, valid or map */
>> +		if (r) {
>> +			/* free this prange resources, remove it from svms */
>> +			svm_range_unlink(prange);
>> +			svm_range_remove_notifier(prange);
>> +			svm_range_free(prange, false);
> Freeing the prange removes SVM mappings from the process. This will break the subsequent execution of the application. In case you were going to return -EAGAIN that's definitely wrong because the application would expect the SVM range to work after a successful retry.
When return -EAGAIN app will do whole range registration again including 
rebuild sub ranges. And at this stage we do not know if subsequent sub 
ranges will be success or fail. So I release current sub range resource 
if it got error(including -EAGAIN). After processing all sub ranges if 
decide to have app do it again, the redo procedure will rebuild the 
released sub ranges.
> I'm not sure the resource waste is a valid argument in case of a fatal error. I would expect the application to terminate anyways in this case, which would result in freeing the resources. Do you see a scenario where an application wants to continue running after this function returned a fatal error?
I made a test app to check the behavior of registration of large range 
for debugging a real issue. I do not know if real app will continue to 
run when hit no -EAGAIN error code. The purpose here is making driver 
handle this case more general.
>
>> +
>> +			/* ret got update when any r != -EAGAIN;
>> +			 * return -EAGAIN when all pranges at update_list
>> +			 * need redo valid/map */
>> +			if (r != -EAGAIN || !ret)
>> +				ret = r;
> This is a good point. But the explanation is a bit misleading: "all pranges ... need redo" is not really true. There may also be ranges that were validated successfully. I think the point you're trying to make is this: Don't return -EAGAIN if there was any previous fatal error found.
ok
> I could potentially see a different optimization. If you encounter a fatal error, you can skip the rest of the ranges and return the error immediately.
>
As said above it is a another way to return immediately if hit no 
-EAGAIN.  but should kfd driver release unused pragne resources any way?

Regards

Xiaogang

>> +		}
>>   	}
>>   
>>   	list_for_each_entry(prange, &remap_list, update_list) {
>> @@ -3729,8 +3740,16 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm,
>>   		if (r)
>>   			pr_debug("failed %d on remap svm range\n", r);
>>   		mutex_unlock(&prange->migrate_mutex);
>> -		if (r)
>> -			ret = r;
>> +
>> +		if (r) {
>> +			/* remove this prange */
>> +			svm_range_unlink(prange);
>> +			svm_range_remove_notifier(prange);
>> +			svm_range_free(prange, false);
> Same as above.
>
> Regards,
>    Felix
>
>
>> +
>> +			if (r != -EAGAIN || !ret)
>> +				ret = r;
>> +		}
>>   	}
>>   
>>   	dynamic_svm_range_dump(svms);
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20250304/3b514601/attachment-0001.htm>