[PATCH drm-misc-next v8 09/12] drm/gpuvm: reference count drm_gpuvm structures

Christian König christian.koenig at amd.com
Thu Nov 9 16:03:37 UTC 2023


Am 09.11.23 um 16:50 schrieb Thomas Hellström:
> [SNIP]
>>>
> Did we get any resolution on this?
>
> FWIW, my take on this is that it would be possible to get GPUVM to 
> work both with and without internal refcounting; If with, the driver 
> needs a vm close to resolve cyclic references, if without that's not 
> necessary. If GPUVM is allowed to refcount in mappings and vm_bos, 
> that comes with a slight performance drop but as Danilo pointed out, 
> the VM lifetime problem iterating over a vm_bo's mapping becomes much 
> easier and the code thus becomes easier to maintain moving forward. 
> That convinced me it's a good thing.

I strongly believe you guys stumbled over one of the core problems with 
the VM here and I think that reference counting is the right answer to 
solving this.

The big question is that what is reference counted and in which 
direction does the dependency points, e.g. we have here VM, BO, BO_VM 
and Mapping objects.

Those patches here suggest a counted Mapping -> VM reference and I'm 
pretty sure that this isn't a good idea. What we should rather really 
have is a BO -> VM or BO_VM ->VM reference. In other words that each BO 
which is part of the VM keeps a reference to the VM.

BTW: At least in amdgpu we can have BOs which (temporary) doesn't have 
any mappings, but are still considered part of the VM.

>
> Another issue Christian brought up is that something intended to be 
> embeddable (a base class) shouldn't really have its own refcount. I 
> think that's a valid point. If you at some point need to derive from 
> multiple such structs each having its own refcount, things will start 
> to get weird. One way to resolve that would be to have the driver's 
> subclass provide get() and put() ops, and export a destructor for the 
> base-class, rather than to have the base-class provide the refcount 
> and a destructor  ops.

Well, I have never seen stuff like that in the kernel. Might be that 
this works, but I would rather not try if avoidable.

>
> That would also make it possible for the driver to decide the context 
> for the put() call: If the driver needs to be able to call put() from 
> irq / atomic context but the base-class'es destructor doesn't allow 
> atomic context, the driver can push freeing out to a work item if needed.
>
> Finally, the refcount overflow Christian pointed out. Limiting the 
> number of mapping sounds like a reasonable remedy to me.

Well that depends, I would rather avoid having a dependency for mappings.

Taking the CPU VM handling as example as far as I know vm_area_structs 
doesn't grab a reference to their mm_struct either. Instead they get 
automatically destroyed when the mm_struct is destroyed.

Which makes sense in that case because when the mm_struct is gone the 
vm_area_struct doesn't make sense any more either.

What we clearly need is a reference to prevent the VM or at least the 
shared resv to go away to early.

Regards,
Christian.

>
> But I think all of this is fixable as follow-ups if needed, unless I'm 
> missing something crucial.
>
> Just my 2 cents.
>
> /Thomas
>
>



More information about the dri-devel mailing list