[RFC 3/3] drm/amdgpu: Add locking to amdgpu_ctx_mgr_entity_fini()
Tvrtko Ursulin
tvrtko.ursulin at igalia.com
Tue May 20 16:23:51 UTC 2025
On 20/05/2025 09:40, Christian König wrote:
> On 5/20/25 09:50, Tvrtko Ursulin wrote:
>>
>> On 20/05/2025 08:39, Christian König wrote:
>>> On 5/19/25 18:37, Tvrtko Ursulin wrote:
>>>> Amdgpu_ctx_mgr_entity_fini() walks the context IDR unlocked so question is
>>>> could it in theory see a stale entry and attempt to destroy the
>>>> drm_sched_entity twice?
>>>
>>> No, when this function is called when the file descriptor is freed up.
>>>
>>> So there should never ever be any concurrent user of this.
>>>
>>>> Problem is I have hit this on a KASAN enabled kernel only _once_ and never
>>>> since. In that case it reported amdgpu_ctx_ioctl() had freed the entity
>>>> already which would mean the question is could we possibly go through the
>>>> mutex_unlock() on one CPU, and another CPU to follow immediately with
>>>> file->release (DRM postclose) and see the stale entry.
>>>
>>> Mhm, it would basically mean that the file descriptor can be closed while some IOCTL on it is still ongoing.
>>
>> I know, like if the VFS side of things was broken.
>>
>>> I think that this would be extremely ugly and should never ever happen in the first place. Adding the mutex just band aids the issue, but not really fixes it.
>>
>> So the part I wasn't sure about was not the ->release() running actually in parallel with ->ioctl(), but straight afterwards, but on a different CPU.
>>
>> If there is any chance the missing mutex_lock() before the IDR walk could mean a lacking memory barrier, so the entry that was just removed by ->ioctl() is seen in ->release().
>>
>> Thread A Thread B
>> amdgpu_ctx_ioctl
>> -> amdgpu_ctx_free
>> mutex_lock
>> idr_remove
>> mutex_unlock
>> fput() fput()
>> ->release()
>> amdgpu_ctx_mgr_fini
>> -> amdgpu_ctx_mgr_entity_fini
>> idr_for_each_entry
>> stale entry ??? -> BOOM
>>
>> Question is does the mutex_unlock() in Thread A ensures that the write into the IDR array would be seen in Thread B, given no mutex_lock() in the latter.
>
> Oh, good question! I have absolutely no idea.
>
> The mutex_unlock() certainly doesn't provide any memory barrier for the other CPU which does the _fini, but the fput() or the ->release code path might.
>
> As far as I know it is rather common to not lock anything in the destroy path. I will try to double check.
FWIW I tried to provoke this with a freshly made IGT today but failed. I
will park it for now since it does not feel a very plausible race.
Regards,
Tvrtko
>> >> Throwing it out there not to forget about it, since I have manage to
>>>> lose the KASAN trace already..
>>>
>>> If you manage to trigger that again please send it to me ASAP.
>>>
>>> Thanks,
>>> Christian.
>>>
>>>>
>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>>> Cc: Alex Deucher <alexander.deucher at amd.com>
>>>> Cc: Christian König <christian.koenig at amd.com>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 ++
>>>> 1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> index 85567d0d9545..95b005ed839e 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>>> @@ -927,6 +927,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr)
>>>> idp = &mgr->ctx_handles;
>>>> + mutex_lock(&mgr->lock);
>>>> idr_for_each_entry(idp, ctx, id) {
>>>> if (kref_read(&ctx->refcount) != 1) {
>>>> DRM_ERROR("ctx %p is still alive\n", ctx);
>>>> @@ -945,6 +946,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr)
>>>> }
>>>> }
>>>> }
>>>> + mutex_unlock(&mgr->lock);
>>>> }
>>>> void amdgpu_ctx_mgr_fini(struct amdgpu_ctx_mgr *mgr)
>>>
>>
>
More information about the amd-gfx
mailing list