[PATCH] drm/amdgpu: Handle duplicate BOs during process restore

Felix Kuehling felix.kuehling at amd.com
Mon Mar 11 17:27:38 UTC 2024


On 2024-03-11 12:33, Christian König wrote:
> Am 11.03.24 um 16:33 schrieb Felix Kuehling:
>> On 2024-03-11 11:25, Joshi, Mukul wrote:
>>> [AMD Official Use Only - General]
>>>
>>>> -----Original Message-----
>>>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>>>> Sent: Monday, March 11, 2024 2:50 AM
>>>> To: Joshi, Mukul <Mukul.Joshi at amd.com>; amd-gfx at lists.freedesktop.org
>>>> Cc: Kuehling, Felix <Felix.Kuehling at amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu: Handle duplicate BOs during process
>>>> restore
>>>>
>>>> Caution: This message originated from an External Source. Use 
>>>> proper caution
>>>> when opening attachments, clicking links, or responding.
>>>>
>>>>
>>>> Am 08.03.24 um 17:22 schrieb Mukul Joshi:
>>>>> In certain situations, some apps can import a BO multiple times
>>>>> (through IPC for example). To restore such processes successfully, we
>>>>> need to tell drm to ignore duplicate BOs.
>>>>> While at it, also add additional logging to prevent silent failures
>>>>> when process restore fails.
>>>>>
>>>>> Signed-off-by: Mukul Joshi <mukul.joshi at amd.com>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 14
>>>> ++++++++++----
>>>>>    1 file changed, 10 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>> index bf8e6653341f..65d808d8b5da 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>> @@ -2869,14 +2869,16 @@ int
>>>>> amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence
>>>>> __rcu *
>>>>>
>>>>>        mutex_lock(&process_info->lock);
>>>>>
>>>>> -     drm_exec_init(&exec, 0);
>>>>> +     drm_exec_init(&exec, DRM_EXEC_IGNORE_DUPLICATES);
>>>>>        drm_exec_until_all_locked(&exec) {
>>>>>                list_for_each_entry(peer_vm, 
>>>>> &process_info->vm_list_head,
>>>>>                                    vm_list_node) {
>>>>>                        ret = amdgpu_vm_lock_pd(peer_vm, &exec, 2);
>>>>> drm_exec_retry_on_contention(&exec);
>>>>> -                     if (unlikely(ret))
>>>>> +                     if (unlikely(ret)) {
>>>>> +                             pr_err("Locking VM PD failed, ret:
>>>>> + %d\n", ret);
>>>>>                                goto ttm_reserve_fail;
>>>>> +                     }
>>>> That's a bad idea. Locking can always be interrupted and that would 
>>>> print an
>>>> error here.
>>>>
>>> Thanks Christian. Will send out a patch to change it to pr_debug.
>>
>> We cannot get interrupted here because we're in a worker thread. We 
>> should be running in non-interruptible mode.
>
> Ah! Ok in that case this isn't necessary.
>
> But in general I think we should avoid error printing like that. If we 
> want to know where something failed there is a function tracker for that.

In this case, it was hard to know that something failed at all. The 
problem manifested as a soft-hang in an application, and it took several 
teams several days to track it down to an eviction/restore problem in 
kernel mode. A failure to reserve BOs seems like the type of problem 
that is not expected here, and would justify an error or warning message 
in the kernel log. That would have helped track down this issue much faster.

Regards,
   Felix


>
> Regards,
> Christian.
>
>>
>> Regards,
>>   Felix
>>
>>
>>>
>>> Regards,
>>> Mukul
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>                }
>>>>>
>>>>>                /* Reserve all BOs and page tables/directory. Add all
>>>>> BOs from @@ -2889,8 +2891,10 @@ int
>>>> amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence
>>>> __rcu *
>>>>>                        gobj = &mem->bo->tbo.base;
>>>>>                        ret = drm_exec_prepare_obj(&exec, gobj, 1);
>>>>> drm_exec_retry_on_contention(&exec);
>>>>> -                     if (unlikely(ret))
>>>>> +                     if (unlikely(ret)) {
>>>>> +                             pr_err("drm_exec_prepare_obj failed,
>>>>> + ret: %d\n", ret);
>>>>>                                goto ttm_reserve_fail;
>>>>> +                     }
>>>>>                }
>>>>>        }
>>>>>
>>>>> @@ -2950,8 +2954,10 @@ int
>>>> amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence
>>>> __rcu *
>>>>>         * validations above would invalidate DMABuf imports again.
>>>>>         */
>>>>>        ret = process_validate_vms(process_info, &exec.ticket);
>>>>> -     if (ret)
>>>>> +     if (ret) {
>>>>> +             pr_err("Validating VMs failed, ret: %d\n", ret);
>>>>>                goto validate_map_fail;
>>>>> +     }
>>>>>
>>>>>        /* Update mappings not managed by KFD */
>>>>>        list_for_each_entry(peer_vm, &process_info->vm_list_head,
>


More information about the amd-gfx mailing list