[PATCH] drm/amdkfd: Change WARN to pr_debug when same userptr BOs got invalidated by mmu.
Chen, Xiaogang
xiaogang.chen at amd.com
Wed Apr 12 05:42:22 UTC 2023
On 4/10/2023 2:58 PM, Felix Kuehling wrote:
> On 2023-04-10 10:36, Xiaogang.Chen wrote:
>> From: Xiaogang Chen <xiaogang.chen at amd.com>
>>
>> During KFD restore evicted userptr BOs mmu invalidate callback may
>> invalidate
>> same userptr BOs that have been just restored. When KFD restore
>> process detects
>> it KFD will reschedule another validation process. It is not an
>> error. Change
>> WARN to pr_debug, not put the BOs at userptr_valid_list, let next
>> scheduled
>> delayed work validate them again.
>
> The problem is not, that a concurrent MMU notifier invalidated the
> pages. The problem is, that the sequence number and the mem->inval
> flag disagree on this. In theory, both the sequence number and the
> mem->inval flag are updated by amdgpu_amdkfd_evict_userptr in the same
> critical section.
>
> When we validate the BO, we set mem->valid to true. If mem->valid gets
> set back to false later, the sequence number should also be updated so
> that amdgpu_ttm_tt_get_user_pages_done should return false. So
> mem->valid and the sequence number should agree on whether the memory
> is valid. However, these WARNs indicate that there is a mismatch. If
> that happens, it means something went wrong. Some of the code's
> assumptions are being violated and this justifies a WARN.
>
yes, mem->invalid change and mmu_interval_set_seq(mni, cur_seq) at
amdgpu_amdkfd_evict_userptr are under process_info->notifier_lock
protection, but mem->range->notifier_seq could be stale when mmu
notifier happens concurrently.
> I think you mentioned that you only see the warnings with the DKMS
> driver. I suspect this is happening on some old get_user_pages code
> path, not the current hmm_range_fault-based one. I would recommend
> looking into this problem on the DKMS branch and addressing the
> problem there. This should not be fixed but removing legitimate WARNs
> on the upstream branch.
It happened with dkms driver with HAVE_AMDKCL_HMM_MIRROR_ENABLED
enabled, it is not old get_user_pages path. I retraced this part code. I
think there are two issues:
1: amdgpu_ttm_tt_get_user_pages can fail(-EFAULT). If it failed we
should not set it valid(mem->invalid = 0). In this case mem has no
associated hmm range associated.
2: mmu notifier can happen concurrently and update
mem->range->notifier->invalidate_seq, but not mem->range->notifier_seq.
That causes mem->range->notifier_seq stale when mem is in
process_info->userptr_inval_list and
amdgpu_amdkfd_restore_userptr_worker got interrupted. At next
rescheduled next attempt we use stale mem->range->notifier_seq to
compare with mem->range->notifier->invalidate_seq.
I will send a patch based on amd-staging-drm-next and amd-staging-dkms-6.1.
Regards
Xiaogang
>
> Regards,
> Felix
>
>
>>
>> Signed-off-by: Xiaogang Chen <Xiaogang.Chen at amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 13 ++++++++++---
>> 1 file changed, 10 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index 7b1f5933ebaa..d0c224703278 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -2581,11 +2581,18 @@ static int
>> confirm_valid_user_pages_locked(struct amdkfd_process_info *process_i
>> mem->range = NULL;
>> if (!valid) {
>> - WARN(!mem->invalid, "Invalid BO not marked invalid");
>> + if (!mem->invalid)
>> + pr_debug("Invalid BO not marked invalid\n");
>> +
>> + ret = -EAGAIN;
>> + continue;
>> + }
>> +
>> + if (mem->invalid) {
>> + pr_debug("Valid BO is marked invalid\n");
>> ret = -EAGAIN;
>> continue;
>> }
>> - WARN(mem->invalid, "Valid BO is marked invalid");
>> list_move_tail(&mem->validate_list.head,
>> &process_info->userptr_valid_list);
>> @@ -2648,7 +2655,7 @@ static void
>> amdgpu_amdkfd_restore_userptr_worker(struct work_struct *work)
>> goto unlock_notifier_out;
>> if (confirm_valid_user_pages_locked(process_info)) {
>> - WARN(1, "User pages unexpectedly invalid");
>> + pr_debug("User pages unexpectedly invalid, reschedule
>> another attempt\n");
>> goto unlock_notifier_out;
>> }
More information about the amd-gfx
mailing list