[RFC 3/3] drm/amdgpu: Add locking to amdgpu_ctx_mgr_entity_fini()

Tue May 20 16:23:51 UTC 2025

On 20/05/2025 09:40, Christian König wrote:
> On 5/20/25 09:50, Tvrtko Ursulin wrote:
>>
>> On 20/05/2025 08:39, Christian König wrote:
>>> On 5/19/25 18:37, Tvrtko Ursulin wrote:
>>>> Amdgpu_ctx_mgr_entity_fini() walks the context IDR unlocked so question is
>>>> could it in theory see a stale entry and attempt to destroy the
>>>> drm_sched_entity twice?
>>>
>>> No, when this function is called when the file descriptor is freed up.
>>>
>>> So there should never ever be any concurrent user of this.
>>>
>>>> Problem is I have hit this on a KASAN enabled kernel only _once_ and never
>>>> since. In that case it reported amdgpu_ctx_ioctl() had freed the entity
>>>> already which would mean the question is could we possibly go through the
>>>> mutex_unlock() on one CPU, and another CPU to follow immediately with
>>>> file->release (DRM postclose) and see the stale entry.
>>>
>>> Mhm, it would basically mean that the file descriptor can be closed while some IOCTL on it is still ongoing.
>>
>> I know, like if the VFS side of things was broken.
>>
>>> I think that this would be extremely ugly and should never ever happen in the first place. Adding the mutex just band aids the issue, but not really fixes it.
>>
>> So the part I wasn't sure about was not the ->release() running actually in parallel with ->ioctl(), but straight afterwards, but on a different CPU.
>>
>> If there is any chance the missing mutex_lock() before the IDR walk could mean a lacking memory barrier, so the entry that was just removed by ->ioctl() is seen in ->release().
>>
>>   Thread A        Thread B
>>   amdgpu_ctx_ioctl
>>    -> amdgpu_ctx_free
>>        mutex_lock
>>        idr_remove
>>        mutex_unlock
>>         fput()        fput()
>>              ->release()
>>              amdgpu_ctx_mgr_fini
>>               -> amdgpu_ctx_mgr_entity_fini
>>                    idr_for_each_entry
>>                  stale entry ??? -> BOOM
>>
>> Question is does the mutex_unlock() in Thread A ensures that the write into the IDR array would be seen in Thread B, given no mutex_lock() in the latter.
> 
> Oh, good question! I have absolutely no idea.
> 
> The mutex_unlock() certainly doesn't provide any memory barrier for the other CPU which does the _fini, but the fput() or the ->release code path might.
> 
> As far as I know it is rather common to not lock anything in the destroy path. I will try to double check.

FWIW I tried to provoke this with a freshly made IGT today but failed. I 
will park it for now since it does not feel a very plausible race.

Regards,

Tvrtko

>>   >> Throwing it out there not to forget about it, since I have manage to
>>>> lose the KASAN trace already..
>>>
>>> If you manage to trigger that again please send it to me ASAP.
>>>
>>> Thanks,
>>> Christian.
>>>
>>>>
>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>>> Cc: Alex Deucher <alexander.deucher at amd.com>
>>>> Cc: Christian König <christian.koenig at amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 ++
>>>>    1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> index 85567d0d9545..95b005ed839e 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> @@ -927,6 +927,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr)
>>>>          idp = &mgr->ctx_handles;
>>>>    +    mutex_lock(&mgr->lock);
>>>>        idr_for_each_entry(idp, ctx, id) {
>>>>            if (kref_read(&ctx->refcount) != 1) {
>>>>                DRM_ERROR("ctx %p is still alive\n", ctx);
>>>> @@ -945,6 +946,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr)
>>>>                }
>>>>            }
>>>>        }
>>>> +    mutex_unlock(&mgr->lock);
>>>>    }
>>>>      void amdgpu_ctx_mgr_fini(struct amdgpu_ctx_mgr *mgr)
>>>
>>
>