[PATCH v5 01/40] drm/gpuvm: Don't require obj lock in destructor path
Danilo Krummrich
dakr at kernel.org
Tue May 20 15:49:42 UTC 2025
On Tue, May 20, 2025 at 08:45:24AM -0700, Rob Clark wrote:
> On Tue, May 20, 2025 at 8:21 AM Danilo Krummrich <dakr at kernel.org> wrote:
> >
> > On Tue, May 20, 2025 at 07:57:36AM -0700, Rob Clark wrote:
> > > On Tue, May 20, 2025 at 12:23 AM Danilo Krummrich <dakr at kernel.org> wrote:
> > > > On Mon, May 19, 2025 at 10:51:24AM -0700, Rob Clark wrote:
> > > > > diff --git a/drivers/gpu/drm/drm_gpuvm.c b/drivers/gpu/drm/drm_gpuvm.c
> > > > > index f9eb56f24bef..1e89a98caad4 100644
> > > > > --- a/drivers/gpu/drm/drm_gpuvm.c
> > > > > +++ b/drivers/gpu/drm/drm_gpuvm.c
> > > > > @@ -1511,7 +1511,9 @@ drm_gpuvm_bo_destroy(struct kref *kref)
> > > > > drm_gpuvm_bo_list_del(vm_bo, extobj, lock);
> > > > > drm_gpuvm_bo_list_del(vm_bo, evict, lock);
> > > > >
> > > > > - drm_gem_gpuva_assert_lock_held(obj);
> > > > > + if (kref_read(&obj->refcount) > 0)
> > > > > + drm_gem_gpuva_assert_lock_held(obj);
> > > >
> > > > Again, this is broken. What if the reference count drops to zero right after
> > > > the kref_read() check, but before drm_gem_gpuva_assert_lock_held() is called?
> > >
> > > No, it is not. If you find yourself having this race condition, then
> > > you already have bigger problems. There are only two valid cases when
> > > drm_gpuvm_bo_destroy() is called. Either:
> > >
> > > 1) You somehow hold a reference to the GEM object, in which case the
> > > refcount will be a positive integer. Maybe you race but on either
> > > side of the race you have a value that is greater than zero.
> > > 2) Or, you are calling this in the GEM object destructor path, in
> > > which case no one else should have a reference to the object, so it
> > > isn't possible to race
> >
> > What about:
> >
> > 3) You destroy the VM_BO, because the VM is destroyed, but someone else (e.g.
> > another VM) holds a reference of this BO, which is dropped concurrently?
>
> I mean, that is already broken, so I'm not sure what your point is?
No, it's not. In upstream GPUVM the last thing drm_gpuvm_bo_destroy() does is
calling drm_gem_object_put(), because a struct drm_gpuvm_bo holds a reference to
the GEM object.
The above is only racy with your patch that disables this reference count and
leaves this trap for the caller.
>
> This patch is specifically about the case were VMAs are torn down in
> gem->free_object().
>
> BR,
> -R
>
> > Please don't tell me "but MSM doesn't do that". This is generic infrastructure,
> > it is perfectly valid for drivers to do that.
> >
> > > If the refcount drops to zero after the check, you are about to blow
> > > up regardless.
> >
> > Exactly, that's why the whole approach of removing the reference count a VM_BO
> > has on the BO, i.e. the proposed DRM_GPUVM_VA_WEAK_REF is broken.
> >
> > As mentioned, make it DRM_GPUVM_MSM_LEGACY_QUIRK and get an approval from Dave /
> > Sima for it.
> >
> > You can't make DRM_GPUVM_VA_WEAK_REF work as a generic thing without breaking
> > the whole design and lifetimes of GPUVM.
> >
> > We'd just end up with tons of traps for drivers with lots of WARN_ON() paths and
> > footguns like the one above if a driver works slightly different than MSM.
More information about the dri-devel
mailing list