[RFC 3/3] drm/amdgpu: Add locking to amdgpu_ctx_mgr_entity_fini()
Christian König
christian.koenig at amd.com
Tue May 20 08:40:13 UTC 2025
On 5/20/25 09:50, Tvrtko Ursulin wrote:
>
> On 20/05/2025 08:39, Christian König wrote:
>> On 5/19/25 18:37, Tvrtko Ursulin wrote:
>>> Amdgpu_ctx_mgr_entity_fini() walks the context IDR unlocked so question is
>>> could it in theory see a stale entry and attempt to destroy the
>>> drm_sched_entity twice?
>>
>> No, when this function is called when the file descriptor is freed up.
>>
>> So there should never ever be any concurrent user of this.
>>
>>> Problem is I have hit this on a KASAN enabled kernel only _once_ and never
>>> since. In that case it reported amdgpu_ctx_ioctl() had freed the entity
>>> already which would mean the question is could we possibly go through the
>>> mutex_unlock() on one CPU, and another CPU to follow immediately with
>>> file->release (DRM postclose) and see the stale entry.
>>
>> Mhm, it would basically mean that the file descriptor can be closed while some IOCTL on it is still ongoing.
>
> I know, like if the VFS side of things was broken.
>
>> I think that this would be extremely ugly and should never ever happen in the first place. Adding the mutex just band aids the issue, but not really fixes it.
>
> So the part I wasn't sure about was not the ->release() running actually in parallel with ->ioctl(), but straight afterwards, but on a different CPU.
>
> If there is any chance the missing mutex_lock() before the IDR walk could mean a lacking memory barrier, so the entry that was just removed by ->ioctl() is seen in ->release().
>
> Thread A Thread B
> amdgpu_ctx_ioctl
> -> amdgpu_ctx_free
> mutex_lock
> idr_remove
> mutex_unlock
> fput() fput()
> ->release()
> amdgpu_ctx_mgr_fini
> -> amdgpu_ctx_mgr_entity_fini
> idr_for_each_entry
> stale entry ??? -> BOOM
>
> Question is does the mutex_unlock() in Thread A ensures that the write into the IDR array would be seen in Thread B, given no mutex_lock() in the latter.
Oh, good question! I have absolutely no idea.
The mutex_unlock() certainly doesn't provide any memory barrier for the other CPU which does the _fini, but the fput() or the ->release code path might.
As far as I know it is rather common to not lock anything in the destroy path. I will try to double check.
Thanks,
Christian.
>
> Regards,
>
> Tvrtko
> >> Throwing it out there not to forget about it, since I have manage to
>>> lose the KASAN trace already..
>>
>> If you manage to trigger that again please send it to me ASAP.
>>
>> Thanks,
>> Christian.
>>
>>>
>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>> Cc: Alex Deucher <alexander.deucher at amd.com>
>>> Cc: Christian König <christian.koenig at amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 ++
>>> 1 file changed, 2 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> index 85567d0d9545..95b005ed839e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>>> @@ -927,6 +927,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr)
>>> idp = &mgr->ctx_handles;
>>> + mutex_lock(&mgr->lock);
>>> idr_for_each_entry(idp, ctx, id) {
>>> if (kref_read(&ctx->refcount) != 1) {
>>> DRM_ERROR("ctx %p is still alive\n", ctx);
>>> @@ -945,6 +946,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr)
>>> }
>>> }
>>> }
>>> + mutex_unlock(&mgr->lock);
>>> }
>>> void amdgpu_ctx_mgr_fini(struct amdgpu_ctx_mgr *mgr)
>>
>
More information about the amd-gfx
mailing list