[PATCH] drm/amdgpu: fix ras calltrace problem by sysfs_update_group

Chen, Guchun Guchun.Chen at amd.com
Thu Sep 24 13:47:34 UTC 2020


[AMD Public Use]

Sure, Christian. I will send patch v2 to correct the commit message.

Regards,
Guchun

-----Original Message-----
From: Christian König <ckoenig.leichtzumerken at gmail.com> 
Sent: Thursday, September 24, 2020 9:31 PM
To: Chen, Guchun <Guchun.Chen at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org; Zhang, Hawking <Hawking.Zhang at amd.com>; Li, Dennis <Dennis.Li at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Clements, John <John.Clements at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; Lazar, Lijo <Lijo.Lazar at amd.com>
Subject: Re: [PATCH] drm/amdgpu: fix ras calltrace problem by sysfs_update_group

Well taking a closer look I think that "31 insertions(+), 56 deletions(-)" makes it clear that this is a cleanup.

So you need to explain that in the commit message as well and not focus on that it fixes an older kernel.

Regards,
Christian.

Am 24.09.20 um 12:32 schrieb Chen, Guchun:
> [AMD Public Use]
>
> Hi Kristian,
>
> Thanks for your review. This is not one workaround, it's one valid fix, I assume. It takes another approach to achieve our goal.
>
> Regards,
> Guchun
>
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken at gmail.com>
> Sent: Thursday, September 24, 2020 6:29 PM
> To: Chen, Guchun <Guchun.Chen at amd.com>; amd-gfx at lists.freedesktop.org; 
> Zhang, Hawking <Hawking.Zhang at amd.com>; Li, Dennis 
> <Dennis.Li at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Clements, John 
> <John.Clements at amd.com>; Deucher, Alexander 
> <Alexander.Deucher at amd.com>; Lazar, Lijo <Lijo.Lazar at amd.com>
> Subject: Re: [PATCH] drm/amdgpu: fix ras calltrace problem by 
> sysfs_update_group
>
> Am 24.09.20 um 12:21 schrieb Guchun Chen:
>> Sysfs_update_group brings below calltrace problem with kernel 3.10 on 
>> RHEL7.9. The cause is the bug of sysfs_update_group on kernel 3.10, 
>> which always fail on one named group, as next calling 
>> internal_create_group will try to create one new sysfs dir 
>> unconditionally, so system complains one duplicate creation.
> NAK, we should not have workarounds for older kernels in the upstream code base. You somehow need to handle this in the DKMS package.
>
> Or is that also valid as a stand alone cleanup?
>
> Regards,
> Christian.
>
>> The patch is to merge the sysfs setup together by calling 
>> sysfs_create_group once, though the nodes would vary on top of different configurations.
>>
>> [    6.531591] WARNING: CPU: 52 PID: 638 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80
>> [    6.531592] sysfs: cannot create duplicate filename '/devices/pci0000:20/0000:20:03.1/0000:21:00.0/0000:22:00.0/0000:23:00.0/ras'
>> [    6.531593] Modules linked in: amdgpu(OE+) amd_iommu_v2 amd_sched(OE) amdttm(OE) amdkcl(OE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci drm libahci igb libata crct10dif_pclmul crct10dif_common crc32c_intel ptp nvme atlantic pps_core dca drm_panel_orientation_quirks nvme_core i2c_algo_bit sdhci_acpi iosf_mbi sdhci mmc_core dm_mirror dm_region_hash dm_log dm_mod fuse
>> [    6.531606] CPU: 52 PID: 638 Comm: systemd-udevd Tainted: G           OE  ------------   3.10.0-1152.el7.x86_64 #1
>> [    6.531609] Hardware name: Gigabyte Technology Co., Ltd. TRX40 AORUS MASTER/TRX40 AORUS MASTER, BIOS F5c 03/05/2020
>> [    6.531610] Call Trace:
>> [    6.531615]  [<ffffffff9b18133a>] dump_stack+0x19/0x1b
>> [    6.531618]  [<ffffffff9aa9b228>] __warn+0xd8/0x100
>> [    6.531619]  [<ffffffff9aa9b2af>] warn_slowpath_fmt+0x5f/0x80
>> [    6.531621]  [<ffffffff9acd8e48>] ? kernfs_path+0x48/0x60
>> [    6.531622]  [<ffffffff9acdbb54>] sysfs_warn_dup+0x64/0x80
>> [    6.531624]  [<ffffffff9acdc6ba>] internal_create_group+0x23a/0x250
>> [    6.531625]  [<ffffffff9acdc706>] sysfs_update_group+0x16/0x20
>> [    6.531660]  [<ffffffffc067fb67>] amdgpu_ras_init+0x1e7/0x240 [amdgpu]
>>      6.531691]  [<ffffffffc063dc7c>] amdgpu_device_init+0xf9c/0x1cb0 [amdgpu]
>> [    6.531694]  [<ffffffff9abe5608>] ? kmalloc_order+0x18/0x40
>> [    6.531698]  [<ffffffff9ac24326>] ? kmalloc_order_trace+0x26/0xa0
>> [    6.531726]  [<ffffffffc0642b1a>] amdgpu_driver_load_kms+0x5a/0x330 [amdgpu]
>> [    6.531753]  [<ffffffffc063a832>] amdgpu_pci_probe+0x172/0x280 [amdgpu]
>> [    6.531757]  [<ffffffff9add653a>] local_pci_probe+0x4a/0xb0
>> [    6.531760]  [<ffffffff9add7c89>] pci_device_probe+0x109/0x160
>> [    6.531762]  [<ffffffff9aebb0e5>] driver_probe_device+0xc5/0x3e0
>> [    6.531764]  [<ffffffff9aebb4e3>] __driver_attach+0x93/0xa0
>> [    6.531765]  [<ffffffff9aebb450>] ? __device_attach+0x50/0x50
>> [    6.531766]  [<ffffffff9aeb8c85>] bus_for_each_dev+0x75/0xc0
>> [    6.531767]  [<ffffffff9aebaa5e>] driver_attach+0x1e/0x20
>> [    6.531769]  [<ffffffff9aeba500>] bus_add_driver+0x200/0x2d0
>> [    6.531770]  [<ffffffff9aebbb74>] driver_register+0x64/0xf0
>> [    6.531771]  [<ffffffff9add74c5>] __pci_register_driver+0xa5/0xc0
>> [    6.531774]  [<ffffffffc0bd5000>] ? 0xffffffffc0bd4fff
>> [    6.531806]  [<ffffffffc0bd50a4>] amdgpu_init+0xa4/0xb0 [amdgpu]
>> [    6.531809]  [<ffffffff9aa0210a>] do_one_initcall+0xba/0x240
>> [    6.531812]  [<ffffffff9ab1e45a>] load_module+0x271a/0x2bb0
>> [    6.531815]  [<ffffffff9adb41c0>] ? ddebug_proc_write+0x100/0x100
>> [    6.531817]  [<ffffffff9ab1e9df>] SyS_init_module+0xef/0x140
>> [    6.531821]  [<ffffffff9b193f92>] system_call_fastpath+0x25/0x2a
>> [    6.531822] ---[ end trace e2d035c822a91de6 ]---
>>
>> Signed-off-by: Guchun Chen <guchun.chen at amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 87 +++++++++----------------
>>    1 file changed, 31 insertions(+), 56 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> index e5ea14774c0c..6c57521b21fe 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> @@ -1027,58 +1027,6 @@ static ssize_t amdgpu_ras_sysfs_features_read(struct device *dev,
>>    	return scnprintf(buf, PAGE_SIZE, "feature mask: 0x%x\n", con->features);
>>    }
>>    
>> -static void amdgpu_ras_sysfs_add_bad_page_node(struct amdgpu_device
>> *adev) -{
>> -	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>> -	struct attribute_group group;
>> -	struct bin_attribute *bin_attrs[] = {
>> -		&con->badpages_attr,
>> -		NULL,
>> -	};
>> -
>> -	con->badpages_attr = (struct bin_attribute) {
>> -		.attr = {
>> -			.name = "gpu_vram_bad_pages",
>> -			.mode = S_IRUGO,
>> -		},
>> -		.size = 0,
>> -		.private = NULL,
>> -		.read = amdgpu_ras_sysfs_badpages_read,
>> -	};
>> -
>> -	group.name = RAS_FS_NAME;
>> -	group.bin_attrs = bin_attrs;
>> -
>> -	sysfs_bin_attr_init(bin_attrs[0]);
>> -
>> -	sysfs_update_group(&adev->dev->kobj, &group);
>> -}
>> -
>> -static int amdgpu_ras_sysfs_create_feature_node(struct amdgpu_device
>> *adev) -{
>> -	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>> -	struct attribute *attrs[] = {
>> -		&con->features_attr.attr,
>> -		NULL
>> -	};
>> -	struct attribute_group group = {
>> -		.name = RAS_FS_NAME,
>> -		.attrs = attrs,
>> -	};
>> -
>> -	con->features_attr = (struct device_attribute) {
>> -		.attr = {
>> -			.name = "features",
>> -			.mode = S_IRUGO,
>> -		},
>> -			.show = amdgpu_ras_sysfs_features_read,
>> -	};
>> -
>> -	sysfs_attr_init(attrs[0]);
>> -
>> -	return sysfs_create_group(&adev->dev->kobj, &group);
>> -}
>> -
>>    static void amdgpu_ras_sysfs_remove_bad_page_node(struct amdgpu_device *adev)
>>    {
>>    	struct amdgpu_ras *con = amdgpu_ras_get_context(adev); @@ 
>> -1300,13
>> +1248,40 @@ static void amdgpu_ras_debugfs_remove_all(struct 
>> +amdgpu_device *adev)
>>    /* debugfs end */
>>    
>>    /* ras fs */
>> -
>> +static BIN_ATTR(gpu_vram_bad_pages, S_IRUGO,
>> +		amdgpu_ras_sysfs_badpages_read, NULL, 0); static 
>> +DEVICE_ATTR(features, S_IRUGO,
>> +		amdgpu_ras_sysfs_features_read, NULL);
>>    static int amdgpu_ras_fs_init(struct amdgpu_device *adev)
>>    {
>> -	amdgpu_ras_sysfs_create_feature_node(adev);
>> +	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>> +	struct attribute_group group = {
>> +		.name = RAS_FS_NAME,
>> +	};
>> +	struct attribute *attrs[] = {
>> +		&con->features_attr.attr,
>> +		NULL
>> +	};
>> +	struct bin_attribute *bin_attrs[] = {
>> +		NULL,
>> +		NULL,
>> +	};
>>    
>> -	if (amdgpu_bad_page_threshold != 0)
>> -		amdgpu_ras_sysfs_add_bad_page_node(adev);
>> +	/* add features entry */
>> +	con->features_attr = dev_attr_features;
>> +	group.attrs = attrs;
>> +	sysfs_attr_init(attrs[0]);
>> +
>> +	if (amdgpu_bad_page_threshold != 0) {
>> +		/* add bad_page_features entry */
>> +		bin_attr_gpu_vram_bad_pages.private = NULL;
>> +		con->badpages_attr = bin_attr_gpu_vram_bad_pages;
>> +		bin_attrs[0] = &con->badpages_attr;
>> +		group.bin_attrs = bin_attrs;
>> +		sysfs_bin_attr_init(bin_attrs[0]);
>> +	}
>> +
>> +	sysfs_create_group(&adev->dev->kobj, &group);
>>    
>>    	return 0;
>>    }
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7CGu
> chun.Chen%40amd.com%7Cb7a499a9d8704a4371d508d8608e1e75%7C3dd8961fe4884
> e608e11a82d994e183d%7C0%7C0%7C637365510798804843&sdata=nIEyvr%2Fnp
> kgSRzf%2FGqdYi39pl8rGnboJzbx8Wp9kWbo%3D&reserved=0


More information about the amd-gfx mailing list