[PATCH] drm/amdgpu: Fix eviction fence worker race during fd close
Yadav, Arvind
arvyadav at amd.com
Thu May 15 14:08:31 UTC 2025
Reviewed-by: Arvind Yadav <Arvind.Yadav at amd.com>
On 5/15/2025 6:49 PM, Liang, Prike wrote:
> [Public]
>
> [Public]
>
> I haven't cleaned up the userq resource destroy at postclose callback in my last patch, so here please remove the duplicated useq destroy. With that, the change in the patch is
> Reviewed-by: Prike Liang <Prike.Liang at amd.com>
>
> Regards,
> Prike
>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
>> Jesse.Zhang
>> Sent: Thursday, May 15, 2025 3:07 PM
>> To: amd-gfx at lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig, Christian
>> <Christian.Koenig at amd.com>; Zhang, Jesse(Jie) <Jesse.Zhang at amd.com>
>> Subject: [PATCH] drm/amdgpu: Fix eviction fence worker race during fd close
>>
>> The current cleanup order during file descriptor close can lead to a race condition
>> where the eviction fence worker attempts to access a destroyed mutex from the
>> user queue manager:
>>
>> [ 517.294055] DEBUG_LOCKS_WARN_ON(lock->magic != lock) [ 517.294060]
>> WARNING: CPU: 8 PID: 2030 at kernel/locking/mutex.c:564 [ 517.294094]
>> Workqueue: events amdgpu_eviction_fence_suspend_worker [amdgpu]
>>
>> The issue occurs because:
>> 1. We destroy the user queue manager (including its mutex) first 2. Then try to
>> destroy eviction fences which may have pending work 3. The eviction fence worker
>> may try to access the already-destroyed mutex
>>
>> Fix this by reordering the cleanup to:
>> 1. First mark the fd as closing and destroy eviction fences,
>> which flushes any pending work
>> 2. Then safely destroy the user queue manager after we're certain
>> no more fence work will be executed
>>
>> Signed-off-by: Jesse Zhang <Jesse.Zhang at amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 2 +-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 2 +-
>> 2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index 4ddd08ce8885..4db92e0a60da 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -2913,8 +2913,8 @@ static int amdgpu_drm_release(struct inode *inode,
>> struct file *filp)
>>
>> if (fpriv) {
>> fpriv->evf_mgr.fd_closing = true;
>> - amdgpu_userq_mgr_fini(&fpriv->userq_mgr);
>> amdgpu_eviction_fence_destroy(&fpriv->evf_mgr);
>> + amdgpu_userq_mgr_fini(&fpriv->userq_mgr);
>> }
>>
>> return drm_release(inode, filp);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> index 9fbb04aee97b..1fec3713fbf2 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> @@ -1504,8 +1504,8 @@ void amdgpu_driver_postclose_kms(struct drm_device
>> *dev,
>>
>> if (!fpriv->evf_mgr.fd_closing) {
>> fpriv->evf_mgr.fd_closing = true;
>> - amdgpu_userq_mgr_fini(&fpriv->userq_mgr);
>> amdgpu_eviction_fence_destroy(&fpriv->evf_mgr);
>> + amdgpu_userq_mgr_fini(&fpriv->userq_mgr);
>> }
>> amdgpu_ctx_mgr_fini(&fpriv->ctx_mgr);
>> amdgpu_vm_fini(adev, &fpriv->vm);
>> --
>> 2.49.0
More information about the amd-gfx
mailing list