[PATCH] drm/amdkfd: Change WARN to pr_debug when same userptr BOs got invalidated by mmu.

Chen, Xiaogang xiaogang.chen at amd.com
Wed Apr 12 05:42:22 UTC 2023


On 4/10/2023 2:58 PM, Felix Kuehling wrote:
> On 2023-04-10 10:36, Xiaogang.Chen wrote:
>> From: Xiaogang Chen <xiaogang.chen at amd.com>
>>
>> During KFD restore evicted userptr BOs mmu invalidate callback may 
>> invalidate
>> same userptr BOs that have been just restored. When KFD restore 
>> process detects
>> it KFD will reschedule another validation process. It is not an 
>> error. Change
>> WARN to pr_debug, not put the BOs at userptr_valid_list, let next 
>> scheduled
>> delayed work validate them again.
>
> The problem is not, that a concurrent MMU notifier invalidated the 
> pages. The problem is, that the sequence number and the mem->inval 
> flag disagree on this. In theory, both the sequence number and the 
> mem->inval flag are updated by amdgpu_amdkfd_evict_userptr in the same 
> critical section.
>
> When we validate the BO, we set mem->valid to true. If mem->valid gets 
> set back to false later, the sequence number should also be updated so 
> that amdgpu_ttm_tt_get_user_pages_done should return false. So 
> mem->valid and the sequence number should agree on whether the memory 
> is valid. However, these WARNs indicate that there is a mismatch. If 
> that happens, it means something went wrong. Some of the code's 
> assumptions are being violated and this justifies a WARN.
>
yes, mem->invalid change and mmu_interval_set_seq(mni, cur_seq) at 
amdgpu_amdkfd_evict_userptr are under process_info->notifier_lock 
protection, but mem->range->notifier_seq could be stale when mmu 
notifier happens concurrently.
> I think you mentioned that you only see the warnings with the DKMS 
> driver. I suspect this is happening on some old get_user_pages code 
> path, not the current hmm_range_fault-based one. I would recommend 
> looking into this problem on the DKMS branch and addressing the 
> problem there. This should not be fixed but removing legitimate WARNs 
> on the upstream branch.

It happened with dkms driver with HAVE_AMDKCL_HMM_MIRROR_ENABLED 
enabled, it is not old get_user_pages path. I retraced this part code. I 
think there are two issues:

1: amdgpu_ttm_tt_get_user_pages can fail(-EFAULT). If it failed we 
should not set it valid(mem->invalid = 0). In this case mem has no 
associated hmm range associated.

2: mmu notifier can happen concurrently and update 
mem->range->notifier->invalidate_seq, but not mem->range->notifier_seq. 
That causes mem->range->notifier_seq stale when mem is in 
process_info->userptr_inval_list and 
amdgpu_amdkfd_restore_userptr_worker got interrupted. At next 
rescheduled next attempt we use stale mem->range->notifier_seq to 
compare with mem->range->notifier->invalidate_seq.

I will send a patch based on amd-staging-drm-next and amd-staging-dkms-6.1.

Regards

Xiaogang

>
> Regards,
>   Felix
>
>
>>
>> Signed-off-by: Xiaogang Chen <Xiaogang.Chen at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 13 ++++++++++---
>>   1 file changed, 10 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index 7b1f5933ebaa..d0c224703278 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -2581,11 +2581,18 @@ static int 
>> confirm_valid_user_pages_locked(struct amdkfd_process_info *process_i
>>             mem->range = NULL;
>>           if (!valid) {
>> -            WARN(!mem->invalid, "Invalid BO not marked invalid");
>> +            if (!mem->invalid)
>> +                pr_debug("Invalid BO not marked invalid\n");
>> +
>> +            ret = -EAGAIN;
>> +            continue;
>> +        }
>> +
>> +        if (mem->invalid) {
>> +            pr_debug("Valid BO is marked invalid\n");
>>               ret = -EAGAIN;
>>               continue;
>>           }
>> -        WARN(mem->invalid, "Valid BO is marked invalid");
>>             list_move_tail(&mem->validate_list.head,
>>                      &process_info->userptr_valid_list);
>> @@ -2648,7 +2655,7 @@ static void 
>> amdgpu_amdkfd_restore_userptr_worker(struct work_struct *work)
>>           goto unlock_notifier_out;
>>         if (confirm_valid_user_pages_locked(process_info)) {
>> -        WARN(1, "User pages unexpectedly invalid");
>> +        pr_debug("User pages unexpectedly invalid, reschedule 
>> another attempt\n");
>>           goto unlock_notifier_out;
>>       }


More information about the amd-gfx mailing list