Switching over to GEM refcounts and a bunch of cleanups

Thu Aug 21 14:59:56 UTC 2025

On 21.08.25 16:06, Thomas Hellström wrote:
>> What are you referring to?
> 
> https://lore.kernel.org/intel-xe/a004736315d77837172418eb196d5b5f80b74e6c.camel@linux.intel.com/

Thanks, that one never made it into my inbox as far as I can see.

> A couple of questions on the design direction here:
> 
> IIRC both xe and i915 has checks to consider objects with a 0 gem
> refcount as zombies requiring special treatment or skipping, when
> encountered in TTM callbacks. We need to double-check that.

I think I've found all of those. The one in i915 were actually not TTM specific but try to catch the same problem on the GEM refcount.

> But I wonder, 
> first this practice of resurrecting refcounts seem a bit unusual, I
> wonder if we can get rid of that somehow?

I was also going back on forth if that is a good idea or not as well.

The usual solution to such kinds of issues is to use two reference counts, so that you got a multi stage cleanup approach. E.g. backing store and object, like what mm_struct is using as well.

The problem was simply that TTM/GEM ended up having *four* reference counts for the same object, each was doing something different and they didn't worked well together at all.

> Furthermore, it seems the problem with drm_exec is related only to the
> LRU walk. What about adding a struct completion to the object, that is
> signaled when the object has freed its final backing-store. The LRU
> walk would then check if the object is a zombie, and if so just wait on
> the struct completion. (Need of course to carefully set up locking
> orders). Then we wouldn't need to resurrect the gem refcount, nor use
> drm_exec locking for zombies.

I had a similar idea, waiting is already possible by waiting for the BOs work item.

But I abandoned that idea because I couldn't see how we could solve the locking.

> We would still need some form of refcounting while waiting on the
> struct completion, but if we restricted the TTM refcount to *only* be
> used internally for that sole purpose, and also replaced the final
> ttm_bo_put() with the ttm_bo_finalize() that you suggest we wouldn't
> need to resurrect that refcount since it wouldn't drop to zero until
> the object is ready for final free.
> 
> Ideas, comments?

Ideally I think we would use the handle_count as backing store the drm_gem_object->refcount as structure reference.

But that means a massive rework of the GEM handling/drivers/TTM.

Alternative we could just grab a reference to a unsignaled fence when we encounter a dead BO on the LRU.

What do you think of that idea?

Regards,
Christian.