[PATCH] drm/amdgpu: Handle duplicate BOs during process restore

Fri Mar 8 16:47:59 UTC 2024

On 2024-03-08 11:22, Mukul Joshi wrote:
> In certain situations, some apps can import a BO multiple times
> (through IPC for example). To restore such processes successfully,
> we need to tell drm to ignore duplicate BOs.
> While at it, also add additional logging to prevent silent failures
> when process restore fails.
>
> Signed-off-by: Mukul Joshi <mukul.joshi at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 14 ++++++++++----
>   1 file changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index bf8e6653341f..65d808d8b5da 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -2869,14 +2869,16 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence __rcu *
>   
>   	mutex_lock(&process_info->lock);
>   
> -	drm_exec_init(&exec, 0);
> +	drm_exec_init(&exec, DRM_EXEC_IGNORE_DUPLICATES);
>   	drm_exec_until_all_locked(&exec) {
>   		list_for_each_entry(peer_vm, &process_info->vm_list_head,
>   				    vm_list_node) {
>   			ret = amdgpu_vm_lock_pd(peer_vm, &exec, 2);
>   			drm_exec_retry_on_contention(&exec);
> -			if (unlikely(ret))
> +			if (unlikely(ret)) {
> +				pr_err("Locking VM PD failed, ret: %d\n", ret);

pr_err makes sense here as it indicates a persistent problem that would 
cause soft hangs, like in this case.

>   				goto ttm_reserve_fail;
> +			}
>   		}
>   
>   		/* Reserve all BOs and page tables/directory. Add all BOs from
> @@ -2889,8 +2891,10 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence __rcu *
>   			gobj = &mem->bo->tbo.base;
>   			ret = drm_exec_prepare_obj(&exec, gobj, 1);
>   			drm_exec_retry_on_contention(&exec);
> -			if (unlikely(ret))
> +			if (unlikely(ret)) {
> +				pr_err("drm_exec_prepare_obj failed, ret: %d\n", ret);

Same here, pr_err is fine.

>   				goto ttm_reserve_fail;
> +			}
>   		}
>   	}
>   
> @@ -2950,8 +2954,10 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence __rcu *
>   	 * validations above would invalidate DMABuf imports again.
>   	 */
>   	ret = process_validate_vms(process_info, &exec.ticket);
> -	if (ret)
> +	if (ret) {
> +		pr_err("Validating VMs failed, ret: %d\n", ret);

I'd make this a pr_debug to avoid spamming the log. validation can fail 
intermittently and rescheduling the worker is there to handle it.

With that fixed, the patch is

Reviewed-by: Felix Kuehling <felix.kuehling at amd.com>

>   		goto validate_map_fail;
> +	}
>   
>   	/* Update mappings not managed by KFD */
>   	list_for_each_entry(peer_vm, &process_info->vm_list_head,