[RFC 3/3] drm/amdgpu: Add locking to amdgpu_ctx_mgr_entity_fini()

Tue May 20 08:40:13 UTC 2025

On 5/20/25 09:50, Tvrtko Ursulin wrote:
> 
> On 20/05/2025 08:39, Christian König wrote:
>> On 5/19/25 18:37, Tvrtko Ursulin wrote:
>>> Amdgpu_ctx_mgr_entity_fini() walks the context IDR unlocked so question is
>>> could it in theory see a stale entry and attempt to destroy the
>>> drm_sched_entity twice?
>>
>> No, when this function is called when the file descriptor is freed up.
>>
>> So there should never ever be any concurrent user of this.
>>
>>> Problem is I have hit this on a KASAN enabled kernel only _once_ and never
>>> since. In that case it reported amdgpu_ctx_ioctl() had freed the entity
>>> already which would mean the question is could we possibly go through the
>>> mutex_unlock() on one CPU, and another CPU to follow immediately with
>>> file->release (DRM postclose) and see the stale entry.
>>
>> Mhm, it would basically mean that the file descriptor can be closed while some IOCTL on it is still ongoing.
> 
> I know, like if the VFS side of things was broken.
> 
>> I think that this would be extremely ugly and should never ever happen in the first place. Adding the mutex just band aids the issue, but not really fixes it.
> 
> So the part I wasn't sure about was not the ->release() running actually in parallel with ->ioctl(), but straight afterwards, but on a different CPU.
> 
> If there is any chance the missing mutex_lock() before the IDR walk could mean a lacking memory barrier, so the entry that was just removed by ->ioctl() is seen in ->release().
> 
>  Thread A        Thread B
>  amdgpu_ctx_ioctl   
>   -> amdgpu_ctx_free
>       mutex_lock
>       idr_remove
>       mutex_unlock
>        fput()        fput()
>             ->release()
>             amdgpu_ctx_mgr_fini               
>              -> amdgpu_ctx_mgr_entity_fini
>                   idr_for_each_entry
>                 stale entry ??? -> BOOM       
> 
> Question is does the mutex_unlock() in Thread A ensures that the write into the IDR array would be seen in Thread B, given no mutex_lock() in the latter.

Oh, good question! I have absolutely no idea.

The mutex_unlock() certainly doesn't provide any memory barrier for the other CPU which does the _fini, but the fput() or the ->release code path might.

As far as I know it is rather common to not lock anything in the destroy path. I will try to double check.

Thanks,
Christian.

> 
> Regards,       
> 
> Tvrtko
>  >> Throwing it out there not to forget about it, since I have manage to
>>> lose the KASAN trace already..
>>
>> If you manage to trigger that again please send it to me ASAP.
>>
>> Thanks,
>> Christian.
>>
>>>
>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>> Cc: Alex Deucher <alexander.deucher at amd.com>
>>> Cc: Christian König <christian.koenig at amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 ++
>>>   1 file changed, 2 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> index 85567d0d9545..95b005ed839e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> @@ -927,6 +927,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr)
>>>         idp = &mgr->ctx_handles;
>>>   +    mutex_lock(&mgr->lock);
>>>       idr_for_each_entry(idp, ctx, id) {
>>>           if (kref_read(&ctx->refcount) != 1) {
>>>               DRM_ERROR("ctx %p is still alive\n", ctx);
>>> @@ -945,6 +946,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr)
>>>               }
>>>           }
>>>       }
>>> +    mutex_unlock(&mgr->lock);
>>>   }
>>>     void amdgpu_ctx_mgr_fini(struct amdgpu_ctx_mgr *mgr)
>>
>