[Intel-gfx] [PATCH 7/9] drm/i915/gt: Fix memory leaks in per-gt sysfs

Tue May 3 04:29:10 UTC 2022

On Sun, 01 May 2022 23:22:02 -0700, Andrzej Hajda wrote:
> On 29.04.2022 06:25, Dixit, Ashutosh wrote:
> > On Thu, 28 Apr 2022 07:36:14 -0700, Andrzej Hajda wrote:
> >> See [1], it is quite old, so maybe it is not valid anymore, but I see no
> >> code proving sth has changed.
> > Hi Andrzej,
> >
> > A lot has changed since that article from 2003 (for 2.5 kernel). For
> > instance there is kernfs (as I mention above):
> >
> >	https://lwn.net/Articles/571590/
> >
> > A process having a sysfs file open today in my view will result in the
> > following:
> > * It will take a reference on kernfs_node (not on kobject as was the case
> >    in kernel 2.5 in [1])
> > * An open file will prevent the module from being unloaded (not the kernel
> >    crashing as in 2.5 in [1])
>
> Thats nice, but kernfs_node->priv still points to kobject so their
> lifetimes are bounded.

Yes, of course there has to be some connection between the kernfs and kobject.

> > So this is what I would expect with today's kernel. I am not seeing
> > anything we've done here which violates anything in [1] or [2].
> >> Also current doc says also [2] similar things, especially:
> >> "Once you registered your kobject via kobject_add(), you must never use
> >> kfree() to free it directly"
> > Correct, we are using kobject_put(), not kfree'ing the kobject.
>
> That I wouldn't agree. kobject_put is called, then the object in which
> kobject is embedded is kfree'd somewhere later on driver removal, without
> awareness of this kobject.  According to your analysis it should have 0
> refs, but this is analysis of the current code, even if it is true now it
> could change in the future.

Yes but we cannot anticipate all changes which can happen in the future,
(though we should handle and make any changes which we can anticipate at
present). This is also basically what the kernel philosophy is, don't make
unnecessary generalizations and try to handle unforseen situations which
can happen in the future.

Let me add some explanations about the patch before addressing your next
point.

1. We are adding 'struct kobject sysfs_gt' to 'struct intel_gt'. We are
   adding the kobject directly, not pointer to kobject. This allows us to
   "reach" 'struct intel_gt' from the kobject using a simple container_of:
   see kobj_to_gt().

2. Because the kobject is not kmalloc'd it cannot be kfree'd so the release
   method has to be empty (or NULL). 'struct intel_gt' is kmalloc'd
   separately elsewhere and memory for the kobject will be freed as part of
   intel_gt.

3. To provide a NULL or empty release method we need to provide a 'struct
   kobj_type kobj_gt_type' associated with the sysfs_gt kobject. This works
   nicely because we were anyway need one for .default_groups (we may add
   other attributes to 'id_groups' in the future). Note that the kobject is
   initialized and added to sysfs using kobject_init_and_add().

4. The only reason for providing an empty release method rather than a NULL
   release method is the following pr_debug in kobject_cleanup():

           if (t && !t->release)
                pr_debug("kobject: '%s' (%p): does not have a release() function, it is broken and must be fixed. See Documentation/core-api/kobject.rst.\n",
                         kobject_name(kobj), kobj);

  This statement could possibly be removed because the release method is
  not needed in the case I just described above, maybe I'll send a patch to
  suggest removing it. Though I think what they will say is that since NULL
  release methods are uncommon maybe just provide an empty release method
  when you need a NULL release method (which is what I have done in the
  patch).

  Also note that, as described below, there are several other cases in the
  kernel which either have NULL or an empty release methods. See below.

> And IMO it is against docs[2]:
> - "One important point cannot be overstated: every kobject must have a
> release() method, and the kobject must persist (in a consistent state)
> until that method is called. If these constraints are not met, the code is
> flawed." - empty release method means clearly it is against the docs.
> -"The end result is that a structure protected by a kobject cannot be freed
> before its reference count goes to zero. The reference count is not under
> the direct control of the code which created the kobject.".
>
> So either docs and part of kobject code were not updated to reflect
> changes you are assuming, either your assumption is incorrect.

In my view the doc is a general introduction to kobjects and simplifies
things. As shown below there are numerous examples in the kernel of both
NULL and empty release methods. I just went with the empty method because
of the reason mentioned above.

> Looking at other users of kobject it seems they follow docs, their
> release method either frees memory directly either kref_put on containing
> struct, it was just quick scan so I could overlooked sth.
>
> >> [1]: https://lwn.net/Articles/36850/
> >> [2]:
> >> https://elixir.bootlin.com/linux/v5.18-rc4/source/Documentation/core-api/kobject.rst#L246

From my comments above, the trick to finding NULL or empty methods is to
search for the following construct:

        kobject_init_and_add(&xyz, ...)

If we do this we find the following:

NULL release method:
	procfs_queue_type
	integrity_ktype
	cppc_ktype
	acpi_hotplug_profile_ktype
	rnbd_dev_ktype
	ioat_ktype
	ab8500_fg_ktype

Empty release method:
	blk_ia_range_sysfs_nop_release()

I hope these examples should be sufficient to show that the release method
can be both NULL or empty.

So I still haven't found any reason to make changes to the v2 patch which I
have previously shown works correctly and without issues.

Thanks.
--
Ashutosh