[PATCH] drm/amdkfd: handle errors from svm validate and map

Felix Kuehling felix.kuehling at amd.com
Wed Sep 13 18:27:09 UTC 2023


On 2023-09-13 13:33, Philip Yang wrote:
>
>
> On 2023-09-13 12:14, Felix Kuehling wrote:
>> On 2023-09-13 11:16, Philip Yang wrote:
>>> If new range is added to update list, splited to multiple pranges with
>>> max_svm_range_pages alignment, and svm validate and map returns error
>>> for the first prange, then the caller retry should add pranges with
>>> prange->is_error_flag or prange without prange->mapped_to_gpu to the
>>> update list, to update GPU mapping for the entire range.
>>
>> It looks like the only new thing here is to remove the "same 
>> attribute" optimization for ranges that are not mapped on the GPU. I 
>> don't fully understand the scenario you're describing here, but it 
>> feels like this change has a bigger impact than it needs to have. 
>> Your description specifically talks about ranges split at 
>> max_svm_range_pages boundaries. But your patch affects all ranges not 
>> mapped on the GPU, even it prange->is_error_flag is not set.
>>
>> Maybe that's OK, because the expensive thing is updating existing 
>> mappings unnecessarily. If there is no existing mapping yet, it's 
>> probably not a big deal. I just don't understand the scenario that 
>> requires a retry  without the prange->is_error_flag being set. Maybe 
>> a better fix would be to ensure that prange->is_error_flag gets set 
>> in your scenario.
>
> Another way to fix the issue, set_attr continue to call 
> svm_range_validate_and_map for all pranges of update_list after 
> svm_range_validate_and_map return error, this change will have less 
> side-effect.
>
Another option is to get rid of the is_error_flag completely. Instead 
only use mapped_to_gpu and set it to false if validate_and_map fails.

Regards,
   Felix


> Regards,
>
> Philip
>
>>
>> Regards,
>>   Felix
>>
>>
>>>
>>> Fixes: c22b04407097 ("drm/amdkfd: flag added to handle errors from 
>>> svm validate and map")
>>> Signed-off-by: Philip Yang <Philip.Yang at amd.com>
>>> Tested-by: James Zhu <james.zhu at amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 5 +++--
>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
>>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> index 61dd66bddc3c..8871329e9cbd 100644
>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>> @@ -825,7 +825,7 @@ svm_range_is_same_attrs(struct kfd_process *p, 
>>> struct svm_range *prange,
>>>           }
>>>       }
>>>   -    return !prange->is_error_flag;
>>> +    return true;
>>>   }
>>>     /**
>>> @@ -2228,7 +2228,8 @@ svm_range_add(struct kfd_process *p, uint64_t 
>>> start, uint64_t size,
>>>           next = interval_tree_iter_next(node, start, last);
>>>           next_start = min(node->last, last) + 1;
>>>   -        if (svm_range_is_same_attrs(p, prange, nattr, attrs)) {
>>> +        if (!prange->is_error_flag && prange->mapped_to_gpu &&
>>> +            svm_range_is_same_attrs(p, prange, nattr, attrs)) {
>>>               /* nothing to do */
>>>           } else if (node->start < start || node->last > last) {
>>>               /* node intersects the update range and its attributes


More information about the amd-gfx mailing list