[RFC 0/4] Some (drm_sched_|dma_)fence lifetime issues

Christian König christian.koenig at amd.com
Wed May 7 14:50:27 UTC 2025


On 5/7/25 16:07, Tvrtko Ursulin wrote:
>>> Are you thinking truly never or for as long someone has a reference?
>>
>> Truly never. It's simply a circle dependency you can never break up.
>>
>> In other words the module references the fence and the fence references the module.
> 
> Past fences being signaled? How?

See how for example amdgpu manages it's VMIDs. Basically the driver keeps an array of all the fence which every used the VMID.

When a VMID is needed the driver checks those fences and eventually frees the signaled ones until an idle VMID is found.

The problem is that freeing the old signaled fences is a lazy operation and only done when a new request comes in.

As far as I know we have tons of those use cases spread all around in different drivers.

>> When the module unloads it drops the reference to the fences ultimately freeing them.
>>
>> The only issue is that modules can both reference their own as well a foreign fences. So what can happen is that you have module A which references fences A1, A2 and B1 and module B which references B1, B2 and A2.
>>
>> Now you can't unload either module first because they cross reference their fences and unloading one would leave the other module with fences which can't be released without crashing.
>>
>> So what we need to have is that the dma_fence framework guarantees that you don't need the fence->ops nor the fence->lock pointer any more after the fence signaled.
> 
> With this option it would mean guarding all entry points with the embedded lock or you had in mind something different? Going simply by the signaled bit looks impossible to be safe.

The module can't unload until all it's fences are signaled, that's obvious.

I think it's save to assume that module unload doesn't happens right after signaling a fence, so we should be save to assume that nobody is inside the callbacks any more after some grace period.

We could use some SRCU or similar to enforce that but my gut feeling is that this would hurt more than help, especially since the code is really performance critical. 


>>> It is also a question how to "revoke" fences safely (race free). It sounds hard to me. It does not seem you got to this last problem in the above branches so I don't know if you had some elegant ideas for that.
>>>
>>> Maybe first to ask if anyone is aware of a precedent where something in the kernel already uses this design pattern?
>>
>> Of hand I don't know of any, but the problem sounds rather common to me.
> 
> Uf I don't know. Feels very atypical to me but I would be very glad to be told otherwise.

I briefly remember that Greg once kicked me because I accidentally violated some rule in the FS layer which implemented something similar for signal/polling.

E.g. you can unmount and unload a file system even if userspace is still sleeping and waiting on something.

I need to look that up again, maybe it provides a good pattern on how to solve this.

Regards,
Christian.



> 
> Regards,
> 
> Tvrtko
> 



More information about the dri-devel mailing list