[PATCH] drm/amdgpu: Fix the race condition for draining retry fault
Felix Kuehling
felix.kuehling at amd.com
Thu Mar 6 00:52:55 UTC 2025
On 2025-03-05 19:49, Emily Deng wrote:
> Issue:
> In the scenario where svm_range_restore_pages is called, but svm->checkpoint_ts
> has not been set and the retry fault has not been drained, svm_range_unmap_from_cpu
> is triggered and calls svm_range_free. Meanwhile, svm_range_restore_pages
> continues execution and reaches svm_range_from_addr. This results in
> a "failed to find prange..." error, causing the page recovery to fail.
>
> How to fix:
> Move the timestamp check code under the protection of svm->lock.
>
> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
> ---
> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 27 ++++++++++++++-------------
> 1 file changed, 14 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index d04725583f19..c99c10e247ad 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -3008,19 +3008,6 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
> goto out;
> }
>
> - /* check if this page fault time stamp is before svms->checkpoint_ts */
> - if (svms->checkpoint_ts[gpuidx] != 0) {
> - if (amdgpu_ih_ts_after_or_equal(ts, svms->checkpoint_ts[gpuidx])) {
> - pr_debug("draining retry fault, drop fault 0x%llx\n", addr);
> - r = 0;
> - goto out;
> - } else
> - /* ts is after svms->checkpoint_ts now, reset svms->checkpoint_ts
> - * to zero to avoid following ts wrap around give wrong comparing
> - */
> - svms->checkpoint_ts[gpuidx] = 0;
> - }
> -
> if (!p->xnack_enabled) {
> pr_debug("XNACK not enabled for pasid 0x%x\n", pasid);
> r = -EFAULT;
> @@ -3040,6 +3027,20 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
> mmap_read_lock(mm);
> retry_write_locked:
> mutex_lock(&svms->lock);
> +
> + /* check if this page fault time stamp is before svms->checkpoint_ts */
> + if (svms->checkpoint_ts[gpuidx] != 0) {
> + if (amdgpu_ih_ts_after_or_equal(ts, svms->checkpoint_ts[gpuidx])) {
> + pr_debug("draining retry fault, drop fault 0x%llx\n", addr);
> + r = 0;
> + goto out;
I believe you need to goto out_unlock_svms here to make sure all the
right locks are released.
Regards,
Felix
> + } else
> + /* ts is after svms->checkpoint_ts now, reset svms->checkpoint_ts
> + * to zero to avoid following ts wrap around give wrong comparing
> + */
> + svms->checkpoint_ts[gpuidx] = 0;
> + }
> +
> prange = svm_range_from_addr(svms, addr, NULL);
> if (!prange) {
> pr_debug("failed to find prange svms 0x%p address [0x%llx]\n",
More information about the amd-gfx
mailing list