[RFC 3/3] drm/amdgpu: Add locking to amdgpu_ctx_mgr_entity_fini()
Tvrtko Ursulin
tvrtko.ursulin at igalia.com
Tue May 20 07:50:38 UTC 2025
On 20/05/2025 08:39, Christian König wrote:
> On 5/19/25 18:37, Tvrtko Ursulin wrote:
>> Amdgpu_ctx_mgr_entity_fini() walks the context IDR unlocked so question is
>> could it in theory see a stale entry and attempt to destroy the
>> drm_sched_entity twice?
>
> No, when this function is called when the file descriptor is freed up.
>
> So there should never ever be any concurrent user of this.
>
>> Problem is I have hit this on a KASAN enabled kernel only _once_ and never
>> since. In that case it reported amdgpu_ctx_ioctl() had freed the entity
>> already which would mean the question is could we possibly go through the
>> mutex_unlock() on one CPU, and another CPU to follow immediately with
>> file->release (DRM postclose) and see the stale entry.
>
> Mhm, it would basically mean that the file descriptor can be closed while some IOCTL on it is still ongoing.
I know, like if the VFS side of things was broken.
> I think that this would be extremely ugly and should never ever happen in the first place. Adding the mutex just band aids the issue, but not really fixes it.
So the part I wasn't sure about was not the ->release() running actually
in parallel with ->ioctl(), but straight afterwards, but on a different
CPU.
If there is any chance the missing mutex_lock() before the IDR walk
could mean a lacking memory barrier, so the entry that was just removed
by ->ioctl() is seen in ->release().
Thread A Thread B
amdgpu_ctx_ioctl
-> amdgpu_ctx_free
mutex_lock
idr_remove
mutex_unlock
fput() fput()
->release()
amdgpu_ctx_mgr_fini
-> amdgpu_ctx_mgr_entity_fini
idr_for_each_entry
stale entry ??? -> BOOM
Question is does the mutex_unlock() in Thread A ensures that the write
into the IDR array would be seen in Thread B, given no mutex_lock() in
the latter.
Regards,
Tvrtko
>> Throwing it out there not to forget about it, since I have manage to
>> lose the KASAN trace already..
>
> If you manage to trigger that again please send it to me ASAP.
>
> Thanks,
> Christian.
>
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>> Cc: Alex Deucher <alexander.deucher at amd.com>
>> Cc: Christian König <christian.koenig at amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 ++
>> 1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> index 85567d0d9545..95b005ed839e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
>> @@ -927,6 +927,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr)
>>
>> idp = &mgr->ctx_handles;
>>
>> + mutex_lock(&mgr->lock);
>> idr_for_each_entry(idp, ctx, id) {
>> if (kref_read(&ctx->refcount) != 1) {
>> DRM_ERROR("ctx %p is still alive\n", ctx);
>> @@ -945,6 +946,7 @@ static void amdgpu_ctx_mgr_entity_fini(struct amdgpu_ctx_mgr *mgr)
>> }
>> }
>> }
>> + mutex_unlock(&mgr->lock);
>> }
>>
>> void amdgpu_ctx_mgr_fini(struct amdgpu_ctx_mgr *mgr)
>
More information about the amd-gfx
mailing list