[PATCH] drm/amdkfd: fix some race conditions in vram buffer alloc/free of svm code

Chen, Xiaogang xiaogang.chen at amd.com
Wed Sep 20 15:58:04 UTC 2023


On 9/20/2023 9:55 AM, Felix Kuehling wrote:
>
> On 2023-09-20 2:17, Xiaogang.Chen wrote:
>> From: Xiaogang Chen <xiaogang.chen at amd.com>
>>
>> This patch fixes:
>> 1: ref number of prange's svm_bo got decreased by an async call from 
>> hmm. When
>> wait svm_bo of prange got released we shoul also wait prang->svm_bo 
>> become NULL,
>> otherwise prange->svm_bo may be set to null after allocate new vram 
>> buffer.
>
> I agree with this part.
>
>
>>
>> 2: During waiting svm_bo of prange got released in a while loop 
>> should schedule
>> current task to give other tasks oppotunity to run, specially the the 
>> workque
>> task that handles svm_bo ref release, otherwise we may enter to 
>> softlock.
>
> We had a similar discussion a few weeks back for another soft lock and 
> I pointed to  cond_reschedule, which seems to be the preferred way to 
> avoid soft locks in the kernel. Does cond_reschedule work for this case?

cond_resched() also works. I will send new one to use cond_resched() 
that is safer for schedule.

Regards

Xiaogang

>
> Regards,
>   Felix
>
>
>>
>> Signed-off-by: Xiaogang.Chen <Xiaogang.Chen at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 8 ++++----
>>   1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> index bed0f8bf83c7..1074a4aedf57 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> @@ -502,11 +502,11 @@ svm_range_validate_svm_bo(struct kfd_node 
>> *node, struct svm_range *prange)
>>         /* We need a new svm_bo. Spin-loop to wait for concurrent
>>        * svm_range_bo_release to finish removing this range from
>> -     * its range list. After this, it is safe to reuse the
>> -     * svm_bo pointer and svm_bo_list head.
>> +     * its range list and set prange->svm_bo to null. After this,
>> +     * it is safe to reuse the svm_bo pointer and svm_bo_list head.
>>        */
>> -    while (!list_empty_careful(&prange->svm_bo_list))
>> -        ;
>> +    while (!list_empty_careful(&prange->svm_bo_list) || prange->svm_bo)
>> +        schedule();
>>         return false;
>>   }


More information about the amd-gfx mailing list