[PATCH v3 10/12] drm/amdgpu: Avoid sysfs dirs removal post device unplug

Andrey Grodzovsky Andrey.Grodzovsky at amd.com
Fri Nov 27 15:34:34 UTC 2020


On 11/27/20 10:04 AM, Daniel Vetter wrote:
> On Wed, Nov 25, 2020 at 12:39:47PM -0500, Andrey Grodzovsky wrote:
>> On 11/25/20 4:04 AM, Daniel Vetter wrote:
>>> On Tue, Nov 24, 2020 at 11:27 PM Andrey Grodzovsky
>>> <Andrey.Grodzovsky at amd.com> wrote:
>>>> On 11/24/20 9:49 AM, Daniel Vetter wrote:
>>>>> On Sat, Nov 21, 2020 at 12:21:20AM -0500, Andrey Grodzovsky wrote:
>>>>>> Avoids NULL ptr due to kobj->sd being unset on device removal.
>>>>>>
>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>>> ---
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   | 4 +++-
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 4 +++-
>>>>>>     2 files changed, 6 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>>>> index caf828a..812e592 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>>>> @@ -27,6 +27,7 @@
>>>>>>     #include <linux/uaccess.h>
>>>>>>     #include <linux/reboot.h>
>>>>>>     #include <linux/syscalls.h>
>>>>>> +#include <drm/drm_drv.h>
>>>>>>
>>>>>>     #include "amdgpu.h"
>>>>>>     #include "amdgpu_ras.h"
>>>>>> @@ -1043,7 +1044,8 @@ static int amdgpu_ras_sysfs_remove_feature_node(struct amdgpu_device *adev)
>>>>>>                .attrs = attrs,
>>>>>>        };
>>>>>>
>>>>>> -    sysfs_remove_group(&adev->dev->kobj, &group);
>>>>>> +    if (!drm_dev_is_unplugged(&adev->ddev))
>>>>>> +            sysfs_remove_group(&adev->dev->kobj, &group);
>>>>> This looks wrong. sysfs, like any other interface, should be
>>>>> unconditionally thrown out when we do the drm_dev_unregister. Whether
>>>>> hotunplugged or not should matter at all. Either this isn't needed at all,
>>>>> or something is wrong with the ordering here. But definitely fishy.
>>>>> -Daniel
>>>> So technically this is needed because kobejct's sysfs directory entry kobj->sd
>>>> is set to NULL
>>>> on device removal (from sysfs_remove_dir) but because we don't finalize the device
>>>> until last reference to drm file is dropped (which can happen later) we end up
>>>> calling sysfs_remove_file/dir after
>>>> this pointer is NULL. sysfs_remove_file checks for NULL and aborts while
>>>> sysfs_remove_dir
>>>> is not and that why I guard against calls to sysfs_remove_dir.
>>>> But indeed the whole approach in the driver is incorrect, as Greg pointed out -
>>>> we should use
>>>> default groups attributes instead of explicit calls to sysfs interface and this
>>>> would save those troubles.
>>>> But again. the issue here of scope of work, converting all of amdgpu to default
>>>> groups attributes is somewhat
>>>> lengthy process with extra testing as the entire driver is papered with sysfs
>>>> references and seems to me more of a standalone
>>>> cleanup, just like switching to devm_ and drmm_ work. To me at least it seems
>>>> that it makes more sense
>>>> to finalize and push the hot unplug patches so that this new functionality can
>>>> be part of the driver sooner
>>>> and then incrementally improve it by working on those other topics. Just as
>>>> devm_/drmm_ I also added sysfs cleanup
>>>> to my TODO list in the RFC patch.
>>> Hm, whether you solve this with the default group stuff to
>>> auto-remove, or remove explicitly at the right time doesn't matter
>>> much. The underlying problem you have here is that it's done way too
>>> late.
>> As far as I understood correctly the default group attrs by reading this
>> article by Greg - https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linux.com%2Fnews%2Fhow-create-sysfs-file-correctly%2F&data=04%7C01%7CAndrey.Grodzovsky%40amd.com%7C3e993d1dfad746ffff2608d892e5bbb9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637420862696611997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HAlEqI6CYR3k1n9FFAibpjBlK7I7x9W23yd5CWJVYgM%3D&reserved=0
>> it will be removed together with the device and not too late like now and I quote
>> from the last paragraph there:
>>
>> "By setting this value, you don’t have to do anything in your
>> probe() or release() functions at all in order for the
>> sysfs files to be properly created and destroyed whenever your
>> device is added or removed from the system. And you will, most
>> importantly, do it in a race-free manner, which is always a good thing."
>>
>> To me this seems like the best solution to the late remove issue. What do
>> you think ?
>>
>>
>>>    sysfs removal (like all uapi interfaces) need to be removed as
>>> part of drm_dev_unregister.
>>
>> Do you mean we need to trace and aggregate all sysfs files creation within
>> the low level drivers and then call some sysfs release function inside
>> drm_dev_unregister
>> to iterate and release them all ?
> That would just reinvent the proper solution Greg explained above. For now
> I think you just need some driver callback that you call right after
> drm_dev_unplug (or drm_dev_unregister) to clean up these sysfs interfaces.
> Afaiui the important part is to clean up your additional interfaces from
> the ->remove callback, since at that point the core sysfs stuff still
> exists.
>
> Maybe you want to do another loop over all IP blocks and a ->unregister
> callback, or maybe it's just 1-2 cases you call directly.


Most of them are barried within non ip block entites (e.g
amdgpu_device_fini->amdgpu_atombios_fini->amdgpu_atombios_fini->device_remove_file->sysfs_remove_file)
or much longer chain in kfd, like 
amdgpu_device_fini->.....kfd_remove_sysfs_node_entry->kfd_remove_sysfs_file->sysfs_remove_file
and so they will will need to be accessed explicitly by creating some accessors 
functions in their public API in multiple layers.


>
>>>    I guess aside from the split into fini_hw
>>> and fini_sw, you also need an unregister_late callback (like we have
>>> already for drm_connector, so that e.g. backlight and similar stuff
>>> can be unregistered).
>>
>> Is this the callback you suggest to call from within drm_dev_unregister and
>> it will be responsible to release all sysfs files created within the driver ?
> Nah that would be an amdgpu ip block callback (forgot what it's called,
> too comfy to fire up an editor right now and look it up, but you have a
> bunch of these loops all over).
>
> I think the core solution we want is what Greg already laid out. This idea
> here was just an amdgpu interim plan, if the core solution is a bit too
> invasive to implement right away.
> -Daniel


By what I showed above to me it looks that the interim solution might be not 
really less invasive then the right
solution by Greg and so if you feel that this is a blocker for the entire patch 
set  and we absolutely can't live
with the temporary band aid which this patch represents then I will just do the 
real solution as a standalone patch set
because I think this one is a big enough change on it's own to combine it with 
the hot device unplug topic.

Andrey


>
>> Andrey
>>
>>
>>> Papering over the underlying bug like this doesn't really fix much,
>>> the lifetimes are still wrong.
>>> -Daniel
>>>
>>>> Andrey
>>>>
>>>>
>>>>>>        return 0;
>>>>>>     }
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
>>>>>> index 2b7c90b..54331fc 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
>>>>>> @@ -24,6 +24,7 @@
>>>>>>     #include <linux/firmware.h>
>>>>>>     #include <linux/slab.h>
>>>>>>     #include <linux/module.h>
>>>>>> +#include <drm/drm_drv.h>
>>>>>>
>>>>>>     #include "amdgpu.h"
>>>>>>     #include "amdgpu_ucode.h"
>>>>>> @@ -464,7 +465,8 @@ int amdgpu_ucode_sysfs_init(struct amdgpu_device *adev)
>>>>>>
>>>>>>     void amdgpu_ucode_sysfs_fini(struct amdgpu_device *adev)
>>>>>>     {
>>>>>> -    sysfs_remove_group(&adev->dev->kobj, &fw_attr_group);
>>>>>> +    if (!drm_dev_is_unplugged(&adev->ddev))
>>>>>> +            sysfs_remove_group(&adev->dev->kobj, &fw_attr_group);
>>>>>>     }
>>>>>>
>>>>>>     static int amdgpu_ucode_init_single_fw(struct amdgpu_device *adev,
>>>>>> --
>>>>>> 2.7.4
>>>>>>
>>>


More information about the amd-gfx mailing list