[PATCH v3] drm/amdgpu: fix the memleak caused by fence not released
Yadav, Arvind
arvyadav at amd.com
Thu Feb 27 15:08:46 UTC 2025
On 2/27/2025 7:55 PM, Christian König wrote:
>
> Am 18.02.25 um 15:53 schrieb Arvind Yadav:
>> Encountering a taint issue during the unloading of gpu_sched
>> due to the fence not being released/put. In this context,
>> amdgpu_vm_clear_freed is responsible for creating a job to
>> update the page table (PT). It allocates kmem_cache for
>> drm_sched_fence and returns the finished fence associated
>> with job->base.s_fence. In case of Usermode queue this finished
>> fence is added to the timeline sync object through
>> amdgpu_gem_update_bo_mapping, which is utilized by user
>> space to ensure the completion of the PT update.
>>
>> [ 508.900587] =============================================================================
>> [ 508.900605] BUG drm_sched_fence (Tainted: G N): Objects remaining in drm_sched_fence on __kmem_cache_shutdown()
>> [ 508.900617] -----------------------------------------------------------------------------
>>
>> [ 508.900627] Slab 0xffffe0cc04548780 objects=32 used=2 fp=0xffff8ea81521f000 flags=0x17ffffc0000240(workingset|head|node=0|zone=2|lastcpupid=0x1fffff)
>> [ 508.900645] CPU: 3 UID: 0 PID: 2337 Comm: rmmod Tainted: G N 6.12.0+ #1
>> [ 508.900651] Tainted: [N]=TEST
>> [ 508.900653] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F34 06/10/2021
>> [ 508.900656] Call Trace:
>> [ 508.900659] <TASK>
>> [ 508.900665] dump_stack_lvl+0x70/0x90
>> [ 508.900674] dump_stack+0x14/0x20
>> [ 508.900678] slab_err+0xcb/0x110
>> [ 508.900687] ? srso_return_thunk+0x5/0x5f
>> [ 508.900692] ? try_to_grab_pending+0xd3/0x1d0
>> [ 508.900697] ? srso_return_thunk+0x5/0x5f
>> [ 508.900701] ? mutex_lock+0x17/0x50
>> [ 508.900708] __kmem_cache_shutdown+0x144/0x2d0
>> [ 508.900713] ? flush_rcu_work+0x50/0x60
>> [ 508.900719] kmem_cache_destroy+0x46/0x1f0
>> [ 508.900728] drm_sched_fence_slab_fini+0x19/0x970 [gpu_sched]
>> [ 508.900736] __do_sys_delete_module.constprop.0+0x184/0x320
>> [ 508.900744] ? srso_return_thunk+0x5/0x5f
>> [ 508.900747] ? debug_smp_processor_id+0x1b/0x30
>> [ 508.900754] __x64_sys_delete_module+0x16/0x20
>> [ 508.900758] x64_sys_call+0xdf/0x20d0
>> [ 508.900763] do_syscall_64+0x51/0x120
>> [ 508.900769] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> v2: call dma_fence_put in amdgpu_gem_va_update_vm
>> v3: Addressed review comments from Christian.
>> - calling amdgpu_gem_update_timeline_node before switch.
>> - puting a dma_fence in case of error or !timeline_syncobj.
>>
>> Cc: Alex Deucher <alexander.deucher at amd.com>
>> Cc: Christian König <christian.koenig at amd.com>
>> Cc: Shashank Sharma <shashank.sharma at amd.com>
>> Cc: Sunil Khatri <sunil.khatri at amd.com>
>> Signed-off-by: Le Ma <le.ma at amd.com>
>> Signed-off-by: Arvind Yadav <arvind.yadav at amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 33 +++++++++++++------------
>> 1 file changed, 17 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> index 8b67aae6c2fe..40522b4f331b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> @@ -125,9 +125,6 @@ amdgpu_gem_update_bo_mapping(struct drm_file *filp,
>> struct amdgpu_vm *vm = &fpriv->vm;
>> struct dma_fence *last_update;
>>
>> - if (!syncobj)
>> - return;
>> -
>> /* Find the last update fence */
>> switch (operation) {
>> case AMDGPU_VA_OP_MAP:
>> @@ -839,6 +836,7 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>> struct drm_exec exec;
>> uint64_t va_flags;
>> uint64_t vm_size;
>> + int syncobj_status;
>> int r = 0;
>>
>> if (args->va_address < AMDGPU_VA_RESERVED_BOTTOM) {
>> @@ -931,6 +929,12 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>> bo_va = NULL;
>> }
>>
>> + syncobj_status = amdgpu_gem_update_timeline_node(filp,
>> + args->vm_timeline_syncobj_out,
>> + args->vm_timeline_point,
>> + &timeline_syncobj,
>> + &timeline_chain);
>> +
> You don't need syncobj_status here, just assign the return value to r and abort if we can't find any syncobj.
I have not used variable 'r' because below inside
switch(args->operation) the 'r' value will be reinitialized and the
return value of amdgpu_gem_update_timeline_node will not be used. Here,
we cannot abort because syncobj will be NULL for Non-UQ.
we can use r but it will not do anything. :)
>
>> switch (args->operation) {
>> case AMDGPU_VA_OP_MAP:
>> va_flags = amdgpu_gem_va_map_flags(adev, args->flags);
>> @@ -957,22 +961,19 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>> break;
>> }
>> if (!r && !(args->flags & AMDGPU_VM_DELAY_UPDATE) && !adev->debug_vm) {
>> -
>> - r = amdgpu_gem_update_timeline_node(filp,
>> - args->vm_timeline_syncobj_out,
>> - args->vm_timeline_point,
>> - &timeline_syncobj,
>> - &timeline_chain);
>> -
>> fence = amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
>> args->operation);
>>
>> - if (!r)
>> - amdgpu_gem_update_bo_mapping(filp, bo_va,
>> - args->operation,
>> - args->vm_timeline_point,
>> - fence, timeline_syncobj,
>> - timeline_chain);
>> + if (syncobj_status || !timeline_syncobj) {
>> + dma_fence_put(fence);
>> + goto error;
>> + }
> That should probably be something like this:
>
> if (timeline_syncobj)
> amdgpu_gem_update_bo_mapping(..);
> else
> dma_fence_put(fence);
Noted....
Thanks,
Arvind
>
> Regards,
> Christian.
>
>> +
>> + amdgpu_gem_update_bo_mapping(filp, bo_va,
>> + args->operation,
>> + args->vm_timeline_point,
>> + fence, timeline_syncobj,
>> + timeline_chain);
>> }
>>
>> error:
More information about the amd-gfx
mailing list