[PATCH v2] drm/amdgpu: fix the memleak caused by fence not released

Yadav, Arvind arvyadav at amd.com
Fri Feb 14 13:02:02 UTC 2025


On 2/14/2025 6:09 PM, Christian König wrote:
> Yeah, completely agree.
>
> But not checking the syncobj handle before doing the update is actually even more problematic than leaking the memory.
>
> This could be used by userspace to put the kernel into a broken situation it can't come out any more.
>
> Arvin can you take care of the complete fix?
Sure, I will do that.


Thanks,

~Arvind
>
> Thanks,
> Christian.
>
> Am 14.02.25 um 13:14 schrieb YuanShang Mao (River):
>> [AMD Official Use Only - AMD Internal Distribution Only]
>>
>> Better to put the fence outside amdgpu_gem_va_update_vm. Since it is passed to the caller, and the caller must keep one reference at least until this fence is no longer needed.
>>
>> Thanks
>> River
>>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Yadav, Arvind
>> Sent: Friday, February 14, 2025 7:42 PM
>> To: Koenig, Christian <Christian.Koenig at amd.com>; Ma, Le <Le.Ma at amd.com>; amd-gfx at lists.freedesktop.org; Yadav, Arvind <Arvind.Yadav at amd.com>
>> Cc: Zhang, Hawking <Hawking.Zhang at amd.com>; Lazar, Lijo <Lijo.Lazar at amd.com>
>> Subject: Re: [PATCH v2] drm/amdgpu: fix the memleak caused by fence not released
>>
>>
>> On 2/14/2025 4:08 PM, Christian König wrote:
>>> Adding Arvind, please make sure to keep him in the loop.
>>>
>>> Am 14.02.25 um 11:07 schrieb Le Ma:
>>>> On systems with CONFIG_SLUB_DEBUG enabled, the memleak like below
>>>> will show up explicitly during driver unloading if created bo without
>>>> drm_timeline object before.
>>>>
>>>>       BUG drm_sched_fence (Tainted: G           OE     ): Objects remaining in drm_sched_fence on __kmem_cache_shutdown()
>>>>       -----------------------------------------------------------------------------
>>>>       Call Trace:
>>>>       <TASK>
>>>>       dump_stack_lvl+0x4c/0x70
>>>>       dump_stack+0x14/0x20
>>>>       slab_err+0xb0/0xf0
>>>>       ? srso_alias_return_thunk+0x5/0xfbef5
>>>>       ? flush_work+0x12/0x20
>>>>       ? srso_alias_return_thunk+0x5/0xfbef5
>>>>       __kmem_cache_shutdown+0x163/0x2e0
>>>>       kmem_cache_destroy+0x61/0x170
>>>>       drm_sched_fence_slab_fini+0x19/0x900
>>>>
>>>> Thus call dma_fence_put properly to avoid the memleak.
>>>>
>>>> v2: call dma_fence_put in amdgpu_gem_va_update_vm
>>>>
>>>> Signed-off-by: Le Ma <le.ma at amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 9 +++++++--
>>>>    1 file changed, 7 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> index 8b67aae6c2fe..00f1f34705c0 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> @@ -759,7 +759,8 @@ static struct dma_fence *
>>>>    amdgpu_gem_va_update_vm(struct amdgpu_device *adev,
>>>>                       struct amdgpu_vm *vm,
>>>>                       struct amdgpu_bo_va *bo_va,
>>>> -                    uint32_t operation)
>>>> +                    uint32_t operation,
>>>> +                    uint32_t syncobj_handle)
>>>>    {
>>>>       struct dma_fence *fence = dma_fence_get_stub();
>>>>       int r;
>>>> @@ -771,6 +772,9 @@ amdgpu_gem_va_update_vm(struct amdgpu_device *adev,
>>>>       if (r)
>>>>               goto error;
>>>>
>>>> +    if (!syncobj_handle)
>>>> +            dma_fence_put(fence);
>>>> +
>>> Having that check inside amdgpu_gem_update_bo_mapping() was actually correct. Here it doesn't make much sense.
>> Agreed,
>>
>> Regards,
>> ~Arvind
>>
>>>>       if (operation == AMDGPU_VA_OP_MAP ||
>>>>           operation == AMDGPU_VA_OP_REPLACE) {
>>>>               r = amdgpu_vm_bo_update(adev, bo_va, false); @@ -965,7 +969,8 @@
>>>> int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>>>                                                   &timeline_chain);
>>> Right before this here is a call to amdgpu_gem_update_timeline_node() which is incorrectly placed.
>>>
>>> That needs to come much earlier, above the switch (args->operation)....
>>>
>>> Regards,
>>> Christian.
>>>
>>>>               fence = amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
>>>> -                                            args->operation);
>>>> +                                            args->operation,
>>>> +                                            args->vm_timeline_syncobj_out);
>>>>
>>>>               if (!r)
>>>>                       amdgpu_gem_update_bo_mapping(filp, bo_va,


More information about the amd-gfx mailing list