[PATCH] drm/amdkfd: Change WARN to pr_debug when same userptr BOs got invalidated by mmu.
Felix Kuehling
felix.kuehling at amd.com
Mon Apr 10 19:58:04 UTC 2023
On 2023-04-10 10:36, Xiaogang.Chen wrote:
> From: Xiaogang Chen <xiaogang.chen at amd.com>
>
> During KFD restore evicted userptr BOs mmu invalidate callback may invalidate
> same userptr BOs that have been just restored. When KFD restore process detects
> it KFD will reschedule another validation process. It is not an error. Change
> WARN to pr_debug, not put the BOs at userptr_valid_list, let next scheduled
> delayed work validate them again.
The problem is not, that a concurrent MMU notifier invalidated the
pages. The problem is, that the sequence number and the mem->inval flag
disagree on this. In theory, both the sequence number and the mem->inval
flag are updated by amdgpu_amdkfd_evict_userptr in the same critical
section.
When we validate the BO, we set mem->valid to true. If mem->valid gets
set back to false later, the sequence number should also be updated so
that amdgpu_ttm_tt_get_user_pages_done should return false. So
mem->valid and the sequence number should agree on whether the memory is
valid. However, these WARNs indicate that there is a mismatch. If that
happens, it means something went wrong. Some of the code's assumptions
are being violated and this justifies a WARN.
I think you mentioned that you only see the warnings with the DKMS
driver. I suspect this is happening on some old get_user_pages code
path, not the current hmm_range_fault-based one. I would recommend
looking into this problem on the DKMS branch and addressing the problem
there. This should not be fixed but removing legitimate WARNs on the
upstream branch.
Regards,
Felix
>
> Signed-off-by: Xiaogang Chen <Xiaogang.Chen at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 13 ++++++++++---
> 1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index 7b1f5933ebaa..d0c224703278 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -2581,11 +2581,18 @@ static int confirm_valid_user_pages_locked(struct amdkfd_process_info *process_i
>
> mem->range = NULL;
> if (!valid) {
> - WARN(!mem->invalid, "Invalid BO not marked invalid");
> + if (!mem->invalid)
> + pr_debug("Invalid BO not marked invalid\n");
> +
> + ret = -EAGAIN;
> + continue;
> + }
> +
> + if (mem->invalid) {
> + pr_debug("Valid BO is marked invalid\n");
> ret = -EAGAIN;
> continue;
> }
> - WARN(mem->invalid, "Valid BO is marked invalid");
>
> list_move_tail(&mem->validate_list.head,
> &process_info->userptr_valid_list);
> @@ -2648,7 +2655,7 @@ static void amdgpu_amdkfd_restore_userptr_worker(struct work_struct *work)
> goto unlock_notifier_out;
>
> if (confirm_valid_user_pages_locked(process_info)) {
> - WARN(1, "User pages unexpectedly invalid");
> + pr_debug("User pages unexpectedly invalid, reschedule another attempt\n");
> goto unlock_notifier_out;
> }
>
More information about the amd-gfx
mailing list