Switching over to GEM refcounts and a bunch of cleanups

Fri Aug 22 08:51:13 UTC 2025

Hi

On Thu, 2025-08-21 at 16:59 +0200, Christian König wrote:
> On 21.08.25 16:06, Thomas Hellström wrote:
> > > What are you referring to?
> > 
> > https://lore.kernel.org/intel-xe/a004736315d77837172418eb196d5b5f80b74e6c.camel@linux.intel.com/
> 
> Thanks, that one never made it into my inbox as far as I can see.
> 
> > A couple of questions on the design direction here:
> > 
> > IIRC both xe and i915 has checks to consider objects with a 0 gem
> > refcount as zombies requiring special treatment or skipping, when
> > encountered in TTM callbacks. We need to double-check that.
> 
> I think I've found all of those. The one in i915 were actually not
> TTM specific but try to catch the same problem on the GEM refcount.
> 
> > But I wonder, 
> > first this practice of resurrecting refcounts seem a bit unusual, I
> > wonder if we can get rid of that somehow?
> 
> I was also going back on forth if that is a good idea or not as well.
> 
> The usual solution to such kinds of issues is to use two reference
> counts, so that you got a multi stage cleanup approach. E.g. backing
> store and object, like what mm_struct is using as well.
> 
> The problem was simply that TTM/GEM ended up having *four* reference
> counts for the same object, each was doing something different and
> they didn't worked well together at all.
> 
> > Furthermore, it seems the problem with drm_exec is related only to
> > the
> > LRU walk. What about adding a struct completion to the object, that
> > is
> > signaled when the object has freed its final backing-store. The LRU
> > walk would then check if the object is a zombie, and if so just
> > wait on
> > the struct completion. (Need of course to carefully set up locking
> > orders). Then we wouldn't need to resurrect the gem refcount, nor
> > use
> > drm_exec locking for zombies.
> 
> I had a similar idea, waiting is already possible by waiting for the
> BOs work item.
> 
> But I abandoned that idea because I couldn't see how we could solve
> the locking.
> 
> > We would still need some form of refcounting while waiting on the
> > struct completion, but if we restricted the TTM refcount to *only*
> > be
> > used internally for that sole purpose, and also replaced the final
> > ttm_bo_put() with the ttm_bo_finalize() that you suggest we
> > wouldn't
> > need to resurrect that refcount since it wouldn't drop to zero
> > until
> > the object is ready for final free.
> > 
> > Ideas, comments?
> 
> Ideally I think we would use the handle_count as backing store the
> drm_gem_object->refcount as structure reference.
> 
> But that means a massive rework of the GEM handling/drivers/TTM.
> 
> Alternative we could just grab a reference to a unsignaled fence when
> we encounter a dead BO on the LRU.
> 
> What do you think of that idea?

I think to be able to *guarantee* exhaustive eviction, we need
1) all unfreed resources to sit on an LRU, and
2) everything on the LRU needs to be able to have something to wait
for.

A fence can't really guarantee 2), but it's close. There is a time-
interval in betwen where the last fence signals and we take the
resource from the LRU and free it.

A struct completion can be made to signal when the resource is freed.
I think the locking restriction in the struct completion case (the
struct completion is likely waited for under a dma-resv), is that
nothing except the object destructor may take an individualized resv of
a zombie gem object whose refcount has gone to zero. The destructor
should use an asserted trylock only to make lockdep happy. The struct
completion also needs a refcount to avoid destroying it while there are
waiters.

So what do you think about starting out with a fence, and if / when
that appears not to be sufficient, we have a backup plan to move to a
struct completion?

Thomas

> 
> Regards,
> Christian.