[PATCH] drm/amdkfd: Fix the deadlock in svm_range_restore_work

Chen, Xiaogang xiaogang.chen at amd.com
Mon Feb 10 02:17:30 UTC 2025


On 2/7/2025 9:02 PM, Deng, Emily wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Ping.......
>
> Emily Deng
> Best Wishes
>
>
>
>> -----Original Message-----
>> From: Emily Deng<Emily.Deng at amd.com>
>> Sent: Friday, February 7, 2025 6:28 PM
>> To:amd-gfx at lists.freedesktop.org
>> Cc: Deng, Emily<Emily.Deng at amd.com>
>> Subject: [PATCH] drm/amdkfd: Fix the deadlock in svm_range_restore_work
>>
>> It will hit deadlock in svm_range_restore_work ramdonly.
>> Detail as below:
>> 1.svm_range_restore_work
>>        ->svm_range_list_lock_and_flush_work
>>        ->mmap_write_lock
>> 2.svm_range_restore_work
>>        ->svm_range_validate_and_map
>>        ->amdgpu_vm_update_range
>>        ->amdgpu_vm_ptes_update
>>        ->amdgpu_vm_pt_alloc
>>        ->svm_range_evict_svm_bo_worker

svm_range_evict_svm_bo_worker is a function running by a kernel task 
from default system_wq. It is not the task that runs 
svm_range_restore_work which is from system_freezable_wq. The second 
task may need wait the first task to release mmap_write_lock, but there 
is no cycle lock dependency.

Can you explain more how deadlock happened? If a deadlock exists between 
two tasks there are should be at least two locks used by both tasks.

Regards

Xiaogang

>>        ->mmap_read_lock(deadlock here, because already get mmap_write_lock)
>>
>> How to fix?
>> Downgrade the write lock to read lock.
>>
>> Signed-off-by: Emily Deng<Emily.Deng at amd.com>
>> ---
>> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> index bd3e20d981e0..c907e2de3dde 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> @@ -1841,6 +1841,7 @@ static void svm_range_restore_work(struct work_struct
>> *work)
>>        mutex_lock(&process_info->lock);
>>        svm_range_list_lock_and_flush_work(svms, mm);
>>        mutex_lock(&svms->lock);
>> +      mmap_write_downgrade(mm);
>>
>>        evicted_ranges = atomic_read(&svms->evicted_ranges);
>>
>> @@ -1890,7 +1891,7 @@ static void svm_range_restore_work(struct work_struct
>> *work)
>>
>> out_reschedule:
>>        mutex_unlock(&svms->lock);
>> -      mmap_write_unlock(mm);
>> +      mmap_read_unlock(mm);
>>        mutex_unlock(&process_info->lock);
>>
>>        /* If validation failed, reschedule another attempt */
>> --
>> 2.34.1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20250209/42cba857/attachment-0001.htm>


More information about the amd-gfx mailing list