[PATCH v2] drm/amdgpu: Fix missing drain retry fault the last entry

Thu Mar 6 00:50:36 UTC 2025

On 2025-03-04 22:54, Emily Deng wrote:
> While the entry get in svm_range_unmap_from_cpu is the last entry, and
> the entry is page fault, it also need to be dropped. So for equal case,
> it also need to be dropped.
>
> v2:
> Only modify the svm_range_restore_pages.
>
> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h | 3 +++
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c   | 2 +-
>   2 files changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
> index 7d4395a5d8ac..b0a88f92cd82 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
> @@ -78,6 +78,9 @@ struct amdgpu_ih_ring {
>   #define amdgpu_ih_ts_after(t1, t2) \
>   		(((int64_t)((t2) << 16) - (int64_t)((t1) << 16)) > 0LL)
>   
> +#define amdgpu_ih_ts_after_or_equal(t1, t2) \
> +		(((int64_t)((t2) << 16) - (int64_t)((t1) << 16)) >= 0LL)
> +
>   /* provided by the ih block */
>   struct amdgpu_ih_funcs {
>   	/* ring read/write ptr handling, called from interrupt context */
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index bd3e20d981e0..d04725583f19 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -3010,7 +3010,7 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
>   
>   	/* check if this page fault time stamp is before svms->checkpoint_ts */
>   	if (svms->checkpoint_ts[gpuidx] != 0) {
> -		if (amdgpu_ih_ts_after(ts,  svms->checkpoint_ts[gpuidx])) {
> +		if (amdgpu_ih_ts_after_or_equal(ts,  svms->checkpoint_ts[gpuidx])) {
>   			pr_debug("draining retry fault, drop fault 0x%llx\n", addr);
>   			r = 0;

Looks reasonable. Should we also set r = -EAGAIN, so that the next fault 
for this address doesn't get rejected by the SW filter? See the code at 
the end of svm_range_restore_pages that handles -EAGAIN and calls 
amdgpu_gmc_filter_faults_remove. I guess that only matters on chips 
where the CAM is not enabled.

Regards,
   Felix

>   			goto out;