[PATCH v3 10/12] drm/amdgpu: Avoid sysfs dirs removal post device unplug

Daniel Vetter daniel at ffwll.ch
Fri Nov 27 15:04:22 UTC 2020


On Wed, Nov 25, 2020 at 12:39:47PM -0500, Andrey Grodzovsky wrote:
> 
> On 11/25/20 4:04 AM, Daniel Vetter wrote:
> > On Tue, Nov 24, 2020 at 11:27 PM Andrey Grodzovsky
> > <Andrey.Grodzovsky at amd.com> wrote:
> > > 
> > > On 11/24/20 9:49 AM, Daniel Vetter wrote:
> > > > On Sat, Nov 21, 2020 at 12:21:20AM -0500, Andrey Grodzovsky wrote:
> > > > > Avoids NULL ptr due to kobj->sd being unset on device removal.
> > > > > 
> > > > > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
> > > > > ---
> > > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   | 4 +++-
> > > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 4 +++-
> > > > >    2 files changed, 6 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > > > > index caf828a..812e592 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > > > > @@ -27,6 +27,7 @@
> > > > >    #include <linux/uaccess.h>
> > > > >    #include <linux/reboot.h>
> > > > >    #include <linux/syscalls.h>
> > > > > +#include <drm/drm_drv.h>
> > > > > 
> > > > >    #include "amdgpu.h"
> > > > >    #include "amdgpu_ras.h"
> > > > > @@ -1043,7 +1044,8 @@ static int amdgpu_ras_sysfs_remove_feature_node(struct amdgpu_device *adev)
> > > > >               .attrs = attrs,
> > > > >       };
> > > > > 
> > > > > -    sysfs_remove_group(&adev->dev->kobj, &group);
> > > > > +    if (!drm_dev_is_unplugged(&adev->ddev))
> > > > > +            sysfs_remove_group(&adev->dev->kobj, &group);
> > > > This looks wrong. sysfs, like any other interface, should be
> > > > unconditionally thrown out when we do the drm_dev_unregister. Whether
> > > > hotunplugged or not should matter at all. Either this isn't needed at all,
> > > > or something is wrong with the ordering here. But definitely fishy.
> > > > -Daniel
> > > 
> > > So technically this is needed because kobejct's sysfs directory entry kobj->sd
> > > is set to NULL
> > > on device removal (from sysfs_remove_dir) but because we don't finalize the device
> > > until last reference to drm file is dropped (which can happen later) we end up
> > > calling sysfs_remove_file/dir after
> > > this pointer is NULL. sysfs_remove_file checks for NULL and aborts while
> > > sysfs_remove_dir
> > > is not and that why I guard against calls to sysfs_remove_dir.
> > > But indeed the whole approach in the driver is incorrect, as Greg pointed out -
> > > we should use
> > > default groups attributes instead of explicit calls to sysfs interface and this
> > > would save those troubles.
> > > But again. the issue here of scope of work, converting all of amdgpu to default
> > > groups attributes is somewhat
> > > lengthy process with extra testing as the entire driver is papered with sysfs
> > > references and seems to me more of a standalone
> > > cleanup, just like switching to devm_ and drmm_ work. To me at least it seems
> > > that it makes more sense
> > > to finalize and push the hot unplug patches so that this new functionality can
> > > be part of the driver sooner
> > > and then incrementally improve it by working on those other topics. Just as
> > > devm_/drmm_ I also added sysfs cleanup
> > > to my TODO list in the RFC patch.
> > Hm, whether you solve this with the default group stuff to
> > auto-remove, or remove explicitly at the right time doesn't matter
> > much. The underlying problem you have here is that it's done way too
> > late.
> 
> As far as I understood correctly the default group attrs by reading this
> article by Greg - https://www.linux.com/news/how-create-sysfs-file-correctly/
> it will be removed together with the device and not too late like now and I quote
> from the last paragraph there:
> 
> "By setting this value, you don’t have to do anything in your
> probe() or release() functions at all in order for the
> sysfs files to be properly created and destroyed whenever your
> device is added or removed from the system. And you will, most
> importantly, do it in a race-free manner, which is always a good thing."
> 
> To me this seems like the best solution to the late remove issue. What do
> you think ?
> 
> 
> >   sysfs removal (like all uapi interfaces) need to be removed as
> > part of drm_dev_unregister.
> 
> 
> Do you mean we need to trace and aggregate all sysfs files creation within
> the low level drivers and then call some sysfs release function inside
> drm_dev_unregister
> to iterate and release them all ?

That would just reinvent the proper solution Greg explained above. For now
I think you just need some driver callback that you call right after
drm_dev_unplug (or drm_dev_unregister) to clean up these sysfs interfaces.
Afaiui the important part is to clean up your additional interfaces from
the ->remove callback, since at that point the core sysfs stuff still
exists.

Maybe you want to do another loop over all IP blocks and a ->unregister
callback, or maybe it's just 1-2 cases you call directly.

> >   I guess aside from the split into fini_hw
> > and fini_sw, you also need an unregister_late callback (like we have
> > already for drm_connector, so that e.g. backlight and similar stuff
> > can be unregistered).
> 
> 
> Is this the callback you suggest to call from within drm_dev_unregister and
> it will be responsible to release all sysfs files created within the driver ?

Nah that would be an amdgpu ip block callback (forgot what it's called,
too comfy to fire up an editor right now and look it up, but you have a
bunch of these loops all over).

I think the core solution we want is what Greg already laid out. This idea
here was just an amdgpu interim plan, if the core solution is a bit too
invasive to implement right away.
-Daniel

> 
> Andrey
> 
> 
> > 
> > Papering over the underlying bug like this doesn't really fix much,
> > the lifetimes are still wrong.
> > -Daniel
> > 
> > > Andrey
> > > 
> > > 
> > > > >       return 0;
> > > > >    }
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
> > > > > index 2b7c90b..54331fc 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
> > > > > @@ -24,6 +24,7 @@
> > > > >    #include <linux/firmware.h>
> > > > >    #include <linux/slab.h>
> > > > >    #include <linux/module.h>
> > > > > +#include <drm/drm_drv.h>
> > > > > 
> > > > >    #include "amdgpu.h"
> > > > >    #include "amdgpu_ucode.h"
> > > > > @@ -464,7 +465,8 @@ int amdgpu_ucode_sysfs_init(struct amdgpu_device *adev)
> > > > > 
> > > > >    void amdgpu_ucode_sysfs_fini(struct amdgpu_device *adev)
> > > > >    {
> > > > > -    sysfs_remove_group(&adev->dev->kobj, &fw_attr_group);
> > > > > +    if (!drm_dev_is_unplugged(&adev->ddev))
> > > > > +            sysfs_remove_group(&adev->dev->kobj, &fw_attr_group);
> > > > >    }
> > > > > 
> > > > >    static int amdgpu_ucode_init_single_fw(struct amdgpu_device *adev,
> > > > > --
> > > > > 2.7.4
> > > > > 
> > 
> > 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


More information about the amd-gfx mailing list