[PATCH v2] drm/amdgpu: Fix the race condition for draining retry fault
Deng, Emily
Emily.Deng at amd.com
Thu Mar 6 02:27:56 UTC 2025
[AMD Official Use Only - AMD Internal Distribution Only]
>-----Original Message-----
>From: Kuehling, Felix <Felix.Kuehling at amd.com>
>Sent: Thursday, March 6, 2025 9:25 AM
>To: amd-gfx at lists.freedesktop.org; Deng, Emily <Emily.Deng at amd.com>
>Subject: Re: [PATCH v2] drm/amdgpu: Fix the race condition for draining retry fault
>
>
>On 2025-03-05 20:10, Emily Deng wrote:
>> Issue:
>> In the scenario where svm_range_restore_pages is called, but svm-
>>checkpoint_ts
>> has not been set and the retry fault has not been drained,
>> svm_range_unmap_from_cpu is triggered and calls svm_range_free.
>> Meanwhile, svm_range_restore_pages continues execution and reaches
>> svm_range_from_addr. This results in a "failed to find prange..." error, causing the
>page recovery to fail.
>>
>> How to fix:
>> Move the timestamp check code under the protection of svm->lock.
>>
>> v2:
>> Make sure all right locks are released before go out.
>>
>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>> ---
>> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 30 ++++++++++++++++------------
>> 1 file changed, 17 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> index d04725583f19..db898757f92e 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> @@ -3008,19 +3008,6 @@ svm_range_restore_pages(struct amdgpu_device
>*adev, unsigned int pasid,
>> goto out;
>> }
>>
>> - /* check if this page fault time stamp is before svms->checkpoint_ts */
>> - if (svms->checkpoint_ts[gpuidx] != 0) {
>> - if (amdgpu_ih_ts_after_or_equal(ts, svms->checkpoint_ts[gpuidx])) {
>> - pr_debug("draining retry fault, drop fault 0x%llx\n", addr);
>> - r = 0;
>> - goto out;
>> - } else
>> - /* ts is after svms->checkpoint_ts now, reset svms-
>>checkpoint_ts
>> - * to zero to avoid following ts wrap around give wrong
>comparing
>> - */
>> - svms->checkpoint_ts[gpuidx] = 0;
>> - }
>> -
>> if (!p->xnack_enabled) {
>> pr_debug("XNACK not enabled for pasid 0x%x\n", pasid);
>> r = -EFAULT;
>> @@ -3040,6 +3027,23 @@ svm_range_restore_pages(struct amdgpu_device
>*adev, unsigned int pasid,
>> mmap_read_lock(mm);
>> retry_write_locked:
>> mutex_lock(&svms->lock);
>> +
>> + /* check if this page fault time stamp is before svms->checkpoint_ts */
>> + if (svms->checkpoint_ts[gpuidx] != 0) {
>> + if (amdgpu_ih_ts_after_or_equal(ts, svms->checkpoint_ts[gpuidx])) {
>> + pr_debug("draining retry fault, drop fault 0x%llx\n", addr);
>> + r = 0;
>> + mutex_unlock(&svms->lock);
>> + mmap_read_unlock(mm);
>> + mmput(mm);
>> + goto out;
>
>I'd prefer goto out_unlock_svm. That way if the locking logic changes, we only need
>to update one place that releases all the right locks in the right order.
>
>If you're worried about counting the fault with svm_range_count_fault, I recommend
>setting r = -EAGAIN and only counting faults when r != -EAGAIN. That way you only
>count successful and failed faults but not retried ones.
>
>Regards,
> Felix
Thanks, modified in v3, please help review again.
Emily Deng
Best Wishes
>
>
>> + } else
>> + /* ts is after svms->checkpoint_ts now, reset svms-
>>checkpoint_ts
>> + * to zero to avoid following ts wrap around give wrong
>comparing
>> + */
>> + svms->checkpoint_ts[gpuidx] = 0;
>> + }
>> +
>> prange = svm_range_from_addr(svms, addr, NULL);
>> if (!prange) {
>> pr_debug("failed to find prange svms 0x%p address [0x%llx]\n",
More information about the amd-gfx
mailing list