drmm vs devm (was Re: [PATCH 2/8] drm/xe: covert sysfs over to devm)

Daniel Vetter daniel at ffwll.ch
Tue Apr 30 13:29:57 UTC 2024


Adding dri-devel because this is kinda more important.

On Mon, Apr 29, 2024 at 04:28:42PM -0500, Lucas De Marchi wrote:
> On Mon, Apr 29, 2024 at 02:45:26PM GMT, Rodrigo Vivi wrote:
> > On Mon, Apr 29, 2024 at 04:17:54PM +0100, Matthew Auld wrote:
> > > On 29/04/2024 14:52, Lucas De Marchi wrote:
> > > > On Mon, Apr 29, 2024 at 09:28:00AM GMT, Rodrigo Vivi wrote:
> > > > > On Mon, Apr 29, 2024 at 01:14:38PM +0100, Matthew Auld wrote:
> > > > > > Hotunplugging the device seems to result in stuff like:
> > > > > >
> > > > > > kobject_add_internal failed for tile0 with -EEXIST, don't try to
> > > > > > register things with the same name in the same directory.
> > > > > >
> > > > > > We only remove the sysfs as part of drmm, however that is tied to the
> > > > > > lifetime of the driver instance and not the device underneath. Attempt
> > > > > > to fix by using devm for all of the remaining sysfs stuff related to the
> > > > > > device.
> > > > >
> > > > > hmmm... so basically we should use the drmm only for the global module
> > > > > stuff and the devm for things that are per device?
> > > >
> > > > that doesn't make much sense. drmm is supposed to run when the driver
> > > > unbinds from the device... basically when all refcounts are gone with
> > > > drm_dev_put().  Are we keeping a ref we shouldn't?
> > > 
> > > It's run when all refcounts are dropped for that particular drm_device, but
> > > that is separate from the physical device underneath (struct device). For
> > > example if something has an open driver fd the drmm release action is not
> > > going to be called until after that is also closed. But in the meantime we
> > > might have already removed the pci device and re-attached it to a newly
> > > allocated drm_device/xe_driver instance, like with hotunplug.
> > > 
> > > For example, currently we don't even call basic stuff like guc_fini() etc.
> > > when removing the pci device, but rather when the drm_device is released,
> > > which sounds quite broken.
> > > 
> > > So roughly drmm is for drm_device software level stuff and devm is for stuff
> > > that needs to happen when removing the device. See also the doc for drmm:
> > > https://elixir.bootlin.com/linux/v6.8-rc1/source/drivers/gpu/drm/drm_managed.c#L23
> > > 
> > > Also: https://docs.kernel.org/gpu/drm-uapi.html#device-hot-unplug
> 
> yeah... I think you convinced me

So rule of thumb:

- devm is for hardware stuff, so like removeing pci mmaps, releasing
  interrupt handlers, cleaning up anything hw related. Because after devm
  respective driver unbind, all that stuff is gone, _even_ when you hold
  onto a struct device reference. Because all that struct device
  reference guarantees is that the software structure stays around as a
  valid memory reference.

- devm is also for remove uapi. Unfortunately we're not quite at the world
  where devm_drm_dev_register is possible, because on the unload side that
  must be done first, and there's still a few things drivers need to do
  after that which isn't fully devm/drmm-ified.

- drmm is for anything software related, so data structures and stuff like
  that. If you have a devm_kmalloc, you very, very likely have a bug. This
  is were you tear down all your software datastructures, which means if
  you have that interleaved with the hw teardown in e.g. guc_fini you have
  some serious work cut out for you. drmm stuff is tied to the drm_device
  lifetime as the core drm uapi interface thing which might stick around
  for much longer than the drm_dev_unregister.

- Finally, when going from the sw side to hw side you must wrap such
  access with drm_dev_enter/exit, or you have races. This is also where
  using drmm and devm for everything really helps, because it gives you a
  very strong hint when you're going from the sw world to the hw world.

  As an example, all the callbacks on the various kms objects are in the
  sw world (so need to be cleaned up with drmm), but the moment you access
  hw (e.g. any mmio) you need to protect that with a drm_dev_enter/exit

Using devm for everything means you have a use-after-free on the sw side,
otoh using devm means you have use-after-free on the hw side (like a
physical hotunplug might reallocate your mmio range to another thunderbolt
device that has been plugged in meanwhile).

It's definitely big time fun all around :-/

Oh also, please help improve the docs on this stuff, I'm trying to make
sure it's all there and ideally the various pieces link to the other
parts, but it's tricky and I understand this stuff to much to spot the
gaps ...

Cheers, Sima

> 
> > 
> > Cc: Aravind and Michal since this likely relates to the FLR discussion...
> > 
> > but it looks to me that we should move more towards the devm_ and limit
> > the usage of drmm_ to some very specific cases...
> 
> agreed,
> 
> Lucas De Marchi
> 
> > 
> > > 
> > > >
> > > > Lucas De Marchi

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


More information about the Intel-xe mailing list