[PATCH] drm/amdkfd: Fix an eviction fence leak

Felix Kuehling felix.kuehling at amd.com
Fri Sep 27 19:48:29 UTC 2024


On 2024-09-27 06:36, Lang Yu wrote:
> dma_fence_get/put() should be called balanced in
> init_kfd_vm() and amdgpu_amdkfd_gpuvm_destroy_cb().

I don't think that's correct. The reference taken in init_kfd_vm is 
returned to the caller of amdgpu_amdkfd_gpuvm_acquire_process_vm, which 
gets stored in the kfd_process structure. I think it's that caller's 
responsibility to drop their reference. I think the real problem is, 
that we're creating a new reference for each VM, but the kfd_process 
structure is only one per process. So the RCU_INIT_POINTER(p->ef, ef); 
in kfd_process_device_init_vm leaks the previous references.

Since we only need to get the eviction fence reference when creating the 
first VM, I suggest this fix in kfd_process_device_init_vm:

          ret = amdgpu_amdkfd_gpuvm_acquire_process_vm(dev->adev, avm,
                                                       &p->kgd_process_info,
-                                                     &ef);
+                                                     p->ef ? NULL : &ef);

And in init_kfd_vm:

          if (ef)
-        *ef = dma_fence_get(&vm->process_info->eviction_fence->base);
+                *ef = dma_fence_get(&vm->process_info->eviction_fence->base);

Regards,
   Felix


>
> Fixes: 9a1c1339abf9 ("drm/amdkfd: Run restore_workers on freezable WQs")
>
> Signed-off-by: Lang Yu <lang.yu at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index ce5ca304dba9..c3a4f8d297f7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -1586,6 +1586,7 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct amdgpu_device *adev,
>   
>   	/* Update process info */
>   	mutex_lock(&process_info->lock);
> +	dma_fence_put(&process_info->eviction_fence->base);
>   	process_info->n_vms--;
>   	list_del(&vm->vm_list_node);
>   	mutex_unlock(&process_info->lock);
> @@ -1598,7 +1599,6 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct amdgpu_device *adev,
>   		WARN_ON(!list_empty(&process_info->userptr_valid_list));
>   		WARN_ON(!list_empty(&process_info->userptr_inval_list));
>   
> -		dma_fence_put(&process_info->eviction_fence->base);
>   		cancel_delayed_work_sync(&process_info->restore_userptr_work);
>   		put_pid(process_info->pid);
>   		mutex_destroy(&process_info->lock);


More information about the amd-gfx mailing list