[PATCH] drm/amdkfd: Change WARN to pr_debug when same userptr BOs got invalidated by mmu.

Felix Kuehling felix.kuehling at amd.com
Mon Apr 10 19:58:04 UTC 2023


On 2023-04-10 10:36, Xiaogang.Chen wrote:
> From: Xiaogang Chen <xiaogang.chen at amd.com>
>
> During KFD restore evicted userptr BOs mmu invalidate callback may invalidate
> same userptr BOs that have been just restored. When KFD restore process detects
> it KFD will reschedule another validation process. It is not an error. Change
> WARN to pr_debug, not put the BOs at userptr_valid_list, let next scheduled
> delayed work validate them again.

The problem is not, that a concurrent MMU notifier invalidated the 
pages. The problem is, that the sequence number and the mem->inval flag 
disagree on this. In theory, both the sequence number and the mem->inval 
flag are updated by amdgpu_amdkfd_evict_userptr in the same critical 
section.

When we validate the BO, we set mem->valid to true. If mem->valid gets 
set back to false later, the sequence number should also be updated so 
that amdgpu_ttm_tt_get_user_pages_done should return false. So 
mem->valid and the sequence number should agree on whether the memory is 
valid. However, these WARNs indicate that there is a mismatch. If that 
happens, it means something went wrong. Some of the code's assumptions 
are being violated and this justifies a WARN.

I think you mentioned that you only see the warnings with the DKMS 
driver. I suspect this is happening on some old get_user_pages code 
path, not the current hmm_range_fault-based one. I would recommend 
looking into this problem on the DKMS branch and addressing the problem 
there. This should not be fixed but removing legitimate WARNs on the 
upstream branch.

Regards,
   Felix


>
> Signed-off-by: Xiaogang Chen <Xiaogang.Chen at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 13 ++++++++++---
>   1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index 7b1f5933ebaa..d0c224703278 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -2581,11 +2581,18 @@ static int confirm_valid_user_pages_locked(struct amdkfd_process_info *process_i
>   
>   		mem->range = NULL;
>   		if (!valid) {
> -			WARN(!mem->invalid, "Invalid BO not marked invalid");
> +			if (!mem->invalid)
> +				pr_debug("Invalid BO not marked invalid\n");
> +
> +			ret = -EAGAIN;
> +			continue;
> +		}
> +
> +		if (mem->invalid) {
> +			pr_debug("Valid BO is marked invalid\n");
>   			ret = -EAGAIN;
>   			continue;
>   		}
> -		WARN(mem->invalid, "Valid BO is marked invalid");
>   
>   		list_move_tail(&mem->validate_list.head,
>   			       &process_info->userptr_valid_list);
> @@ -2648,7 +2655,7 @@ static void amdgpu_amdkfd_restore_userptr_worker(struct work_struct *work)
>   		goto unlock_notifier_out;
>   
>   	if (confirm_valid_user_pages_locked(process_info)) {
> -		WARN(1, "User pages unexpectedly invalid");
> +		pr_debug("User pages unexpectedly invalid, reschedule another attempt\n");
>   		goto unlock_notifier_out;
>   	}
>   


More information about the amd-gfx mailing list