[PATCH] drm/amdgpu: fix reset domain xgmi hive info reference leak
Felix Kuehling
felix.kuehling at amd.com
Fri Aug 12 22:11:43 UTC 2022
On 2022-08-12 18:05, Andrey Grodzovsky wrote:
>
> On 2022-08-12 14:38, Kim, Jonathan wrote:
>> [Public]
>>
>> Hi Andrey,
>>
>> Here's the load/unload stack trace. This is a 2 GPU xGMI system. I
>> put dbg_xgmi_hive_get/put refcount print post kobj get/put.
>> It's stuck at 2 on unload. If it's an 8 GPU system, it's stuck at 8.
>>
>> e.g. of sysfs leak after driver unload:
>> atitest at atitest:/sys/devices/pci0000:80/0000:80:02.0/0000:81:00.0/0000:82:00.0/0000:83:00.0$
>> ls xgmi_hive_info/
>> xgmi_hive_id
>>
>> Thanks,
>>
>> Jon
>
>
> I see the leak, but how is it related to amdgpu_reset_domain ? How you
> think that he causing this ?
Does YiPeng's patch "[PATCH 2/2] drm/amdgpu: fix hive reference leak
when adding xgmi device" address the same issue?
Regards,
Felix
>
> Andrey
>
>
>>
>>
>> Driver load (get ref happens on both device add to hive and init per
>> device):
>> [ 61.975900] amdkcl: loading out-of-tree module taints kernel.
>> [ 61.975973] amdkcl: module verification failed: signature and/or
>> required key missing - tainting kernel
>> [ 62.065546] amdkcl: Warning: fail to get symbol cancel_work,
>> replace it with kcl stub
>> [ 62.081920] AMD-Vi: AMD IOMMUv2 functionality not available on
>> this system - This is not a bug.
>> [ 62.491119] [drm] amdgpu kernel modesetting enabled.
>> [ 62.491122] [drm] amdgpu version: 5.18.2
>> [ 62.491124] [drm] OS DRM version: 5.15.0
>> [ 62.491337] amdgpu: CRAT table not found
>> [ 62.491341] amdgpu: Virtual CRAT table created for CPU
>> [ 62.491360] amdgpu: Topology: Add CPU node
>> [ 62.603556] amdgpu: PeerDirect support was initialized successfully
>> [ 62.603847] amdgpu 0000:83:00.0: enabling device (0100 -> 0102)
>> [ 62.603987] [drm] initializing kernel modesetting (VEGA20
>> 0x1002:0x66A1 0x1002:0x0834 0x00).
>> [ 62.604023] [drm] register mmio base: 0xFBD00000
>> [ 62.604026] [drm] register mmio size: 524288
>> [ 62.604171] [drm] add ip block number 0 <soc15_common>
>> [ 62.604175] [drm] add ip block number 1 <gmc_v9_0>
>> [ 62.604177] [drm] add ip block number 2 <vega20_ih>
>> [ 62.604180] [drm] add ip block number 3 <psp>
>> [ 62.604182] [drm] add ip block number 4 <powerplay>
>> [ 62.604185] [drm] add ip block number 5 <dm>
>> [ 62.604187] [drm] add ip block number 6 <gfx_v9_0>
>> [ 62.604190] [drm] add ip block number 7 <sdma_v4_0>
>> [ 62.604192] [drm] add ip block number 8 <uvd_v7_0>
>> [ 62.604194] [drm] add ip block number 9 <vce_v4_0>
>> [ 62.641771] amdgpu 0000:83:00.0: amdgpu: Fetched VBIOS from ROM BAR
>> [ 62.641777] amdgpu: ATOM BIOS: 113-D1630200-112
>> [ 62.713418] [drm] UVD(0) is enabled in VM mode
>> [ 62.713423] [drm] UVD(1) is enabled in VM mode
>> [ 62.713426] [drm] UVD(0) ENC is enabled in VM mode
>> [ 62.713428] [drm] UVD(1) ENC is enabled in VM mode
>> [ 62.713430] [drm] VCE enabled in VM mode
>> [ 62.713433] amdgpu 0000:83:00.0: amdgpu: Trusted Memory Zone (TMZ)
>> feature not supported
>> [ 62.713472] [drm] GPU posting now...
>> [ 62.713993] amdgpu 0000:83:00.0: amdgpu: MEM ECC is active.
>> [ 62.713995] amdgpu 0000:83:00.0: amdgpu: SRAM ECC is active.
>> [ 62.714006] amdgpu 0000:83:00.0: amdgpu: RAS INFO: ras initialized
>> successfully, hardware ability[7fff] ras_mask[7fff]
>> [ 62.714018] [drm] vm size is 262144 GB, 4 levels, block size is
>> 9-bit, fragment size is 9-bit
>> [ 62.714026] amdgpu 0000:83:00.0: amdgpu: VRAM: 32752M
>> 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
>> [ 62.714029] amdgpu 0000:83:00.0: amdgpu: GART: 512M
>> 0x0000000000000000 - 0x000000001FFFFFFF
>> [ 62.714032] amdgpu 0000:83:00.0: amdgpu: AGP: 267845632M
>> 0x0000009000000000 - 0x0000FFFFFFFFFFFF
>> [ 62.714043] [drm] Detected VRAM RAM=32752M, BAR=32768M
>> [ 62.714044] [drm] RAM width 4096bits HBM
>> [ 62.714050] debugfs: Directory 'ttm' with parent '/' already present!
>> [ 62.714146] [drm] amdgpu: 32752M of VRAM memory ready
>> [ 62.714149] [drm] amdgpu: 40203M of GTT memory ready.
>> [ 62.714170] [drm] GART: num cpu pages 131072, num gpu pages 131072
>> [ 62.714266] [drm] PCIE GART of 512M enabled.
>> [ 62.714267] [drm] PTB located at 0x0000008000000000
>> [ 62.731067] amdgpu 0000:83:00.0: amdgpu: PSP runtime database
>> doesn't exist
>> [ 62.731075] amdgpu 0000:83:00.0: amdgpu: PSP runtime database
>> doesn't exist
>> [ 62.731449] amdgpu: [powerplay] hwmgr_sw_init smu backed is
>> vega20_smu
>> [ 62.743177] [drm] Found UVD firmware ENC: 1.2 DEC: .43 Family ID: 19
>> [ 62.743244] [drm] PSP loading UVD firmware
>> [ 62.744525] [drm] Found VCE firmware Version: 57.6 Binary ID: 4
>> [ 62.744689] [drm] PSP loading VCE firmware
>> [ 62.896804] [drm] reserve 0x400000 from 0x87fec00000 for PSP TMR
>> [ 62.979421] amdgpu 0000:83:00.0: amdgpu: HDCP: optional hdcp ta
>> ucode is not available
>> [ 62.979427] amdgpu 0000:83:00.0: amdgpu: DTM: optional dtm ta
>> ucode is not available
>> [ 62.979430] amdgpu 0000:83:00.0: amdgpu: RAP: optional rap ta
>> ucode is not available
>> [ 62.979432] amdgpu 0000:83:00.0: amdgpu: SECUREDISPLAY:
>> securedisplay ta ucode is not available
>> [ 62.982386] [drm] Display Core initialized with v3.2.196!
>> [ 62.984514] [drm] kiq ring mec 2 pipe 1 q 0
>> [ 63.026846] [drm] UVD and UVD ENC initialized successfully.
>> [ 63.225760] [drm] VCE initialized successfully.
>> [ 63.244442] amdgpu: [dbg_xgmi_hive_get] ref_count 2
>> [ 63.244448] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted:
>> G OE 5.15.0-46-generic #49~20.04.1-Ubuntu
>> [ 63.244454] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1
>> 09/14/2018
>> [ 63.244457] Workqueue: events work_for_cpu_fn
>> [ 63.244471] Call Trace:
>> [ 63.244474] <TASK>
>> [ 63.244479] dump_stack_lvl+0x4a/0x63
>> [ 63.244493] dump_stack+0x10/0x16
>> [ 63.244501] amdgpu_get_xgmi_hive+0x217/0x2a0 [amdgpu]
>> [ 63.245047] amdgpu_xgmi_add_device+0xcc/0x450 [amdgpu]
>> [ 63.245463] ? amdgpu_ras_recovery_init+0x13d/0x2e0 [amdgpu]
>> [ 63.245879] ? vce_v4_0_hw_init.cold+0xc/0x13 [amdgpu]
>> [ 63.246466] amdgpu_device_init.cold+0x15bd/0x1fe3 [amdgpu]
>> [ 63.247055] ? pci_bus_read_config_word+0x4a/0x70
>> [ 63.247064] ? do_pci_enable_device+0xdb/0x110
>> [ 63.247070] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
>> [ 63.247463] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
>> [ 63.247868] local_pci_probe+0x4b/0x90
>> [ 63.247876] work_for_cpu_fn+0x1a/0x30
>> [ 63.247881] process_one_work+0x22b/0x3d0
>> [ 63.247887] worker_thread+0x21d/0x3f0
>> [ 63.247893] ? process_one_work+0x3d0/0x3d0
>> [ 63.247898] kthread+0x12a/0x150
>> [ 63.247905] ? set_kthread_struct+0x50/0x50
>> [ 63.247910] ret_from_fork+0x22/0x30
>> [ 63.247922] </TASK>
>> [ 63.248563] amdgpu 0000:83:00.0: amdgpu: XGMI: Add node 0, hive
>> 0x25bbae7e3fd04cf4.
>> [ 63.248569] amdgpu: [dbg_xgmi_hive_get] ref_count 3
>> [ 63.248572] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted:
>> G OE 5.15.0-46-generic #49~20.04.1-Ubuntu
>> [ 63.248578] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1
>> 09/14/2018
>> [ 63.248580] Workqueue: events work_for_cpu_fn
>> [ 63.248587] Call Trace:
>> [ 63.248588] <TASK>
>> [ 63.248590] dump_stack_lvl+0x4a/0x63
>> [ 63.248598] dump_stack+0x10/0x16
>> [ 63.248604] amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu]
>> [ 63.249033] amdgpu_device_init.cold+0x15cd/0x1fe3 [amdgpu]
>> [ 63.249621] ? pci_bus_read_config_word+0x4a/0x70
>> [ 63.249627] ? do_pci_enable_device+0xdb/0x110
>> [ 63.249632] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
>> [ 63.250022] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
>> [ 63.250410] local_pci_probe+0x4b/0x90
>> [ 63.250416] work_for_cpu_fn+0x1a/0x30
>> [ 63.250421] process_one_work+0x22b/0x3d0
>> [ 63.250428] worker_thread+0x21d/0x3f0
>> [ 63.250434] ? process_one_work+0x3d0/0x3d0
>> [ 63.250440] kthread+0x12a/0x150
>> [ 63.250445] ? set_kthread_struct+0x50/0x50
>> [ 63.250450] ret_from_fork+0x22/0x30
>> [ 63.250458] </TASK>
>> [ 63.268869] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
>> [ 63.269180] amdgpu: sdma_bitmap: ffff
>> [ 63.605188] memmap_init_zone_device initialised 8388608 pages in
>> 132ms
>> [ 63.605203] amdgpu: HMM registered 32752MB device memory
>> [ 63.605244] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled!
>>
>> [ 63.605263] amdgpu: Virtual CRAT table created for GPU
>> [ 63.605651] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled!
>>
>> [ 63.605659] amdgpu: Topology: Add dGPU node [0x66a1:0x1002]
>> [ 63.605670] kfd kfd: amdgpu: added device 1002:66a1
>> [ 63.626300] amdgpu 0000:83:00.0: amdgpu: SE 4, SH per SE 1, CU per
>> SH 16, active_cu_number 64
>> [ 63.626517] amdgpu 0000:83:00.0: amdgpu: ring gfx uses VM inv eng
>> 0 on hub 0
>> [ 63.626522] amdgpu 0000:83:00.0: amdgpu: ring comp_1.0.0 uses VM
>> inv eng 1 on hub 0
>> [ 63.626525] amdgpu 0000:83:00.0: amdgpu: ring comp_1.1.0 uses VM
>> inv eng 4 on hub 0
>> [ 63.626529] amdgpu 0000:83:00.0: amdgpu: ring comp_1.2.0 uses VM
>> inv eng 5 on hub 0
>> [ 63.626531] amdgpu 0000:83:00.0: amdgpu: ring comp_1.3.0 uses VM
>> inv eng 6 on hub 0
>> [ 63.626534] amdgpu 0000:83:00.0: amdgpu: ring comp_1.0.1 uses VM
>> inv eng 7 on hub 0
>> [ 63.626537] amdgpu 0000:83:00.0: amdgpu: ring comp_1.1.1 uses VM
>> inv eng 8 on hub 0
>> [ 63.626540] amdgpu 0000:83:00.0: amdgpu: ring comp_1.2.1 uses VM
>> inv eng 9 on hub 0
>> [ 63.626543] amdgpu 0000:83:00.0: amdgpu: ring comp_1.3.1 uses VM
>> inv eng 10 on hub 0
>> [ 63.626546] amdgpu 0000:83:00.0: amdgpu: ring kiq_2.1.0 uses VM
>> inv eng 11 on hub 0
>> [ 63.626549] amdgpu 0000:83:00.0: amdgpu: ring sdma0 uses VM inv
>> eng 0 on hub 1
>> [ 63.626552] amdgpu 0000:83:00.0: amdgpu: ring page0 uses VM inv
>> eng 1 on hub 1
>> [ 63.626555] amdgpu 0000:83:00.0: amdgpu: ring sdma1 uses VM inv
>> eng 4 on hub 1
>> [ 63.626558] amdgpu 0000:83:00.0: amdgpu: ring page1 uses VM inv
>> eng 5 on hub 1
>> [ 63.626561] amdgpu 0000:83:00.0: amdgpu: ring uvd_0 uses VM inv
>> eng 6 on hub 1
>> [ 63.626563] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_0.0 uses VM
>> inv eng 7 on hub 1
>> [ 63.626566] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_0.1 uses VM
>> inv eng 8 on hub 1
>> [ 63.626569] amdgpu 0000:83:00.0: amdgpu: ring uvd_1 uses VM inv
>> eng 9 on hub 1
>> [ 63.626572] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_1.0 uses VM
>> inv eng 10 on hub 1
>> [ 63.626575] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_1.1 uses VM
>> inv eng 11 on hub 1
>> [ 63.626577] amdgpu 0000:83:00.0: amdgpu: ring vce0 uses VM inv eng
>> 12 on hub 1
>> [ 63.626580] amdgpu 0000:83:00.0: amdgpu: ring vce1 uses VM inv eng
>> 13 on hub 1
>> [ 63.626583] amdgpu 0000:83:00.0: amdgpu: ring vce2 uses VM inv eng
>> 14 on hub 1
>> [ 63.636996] amdgpu: Detected AMDGPU DF Counters. # of Counters = 8.
>> [ 63.637046] amdgpu: Detected AMDGPU 2 Perf Events.
>> [ 63.637428] [drm] Initialized amdgpu 3.48.0 20150101 for
>> 0000:83:00.0 on minor 1
>> [ 63.637937] amdgpu 0000:86:00.0: enabling device (0100 -> 0102)
>> [ 63.638043] [drm] initializing kernel modesetting (VEGA20
>> 0x1002:0x66A1 0x1002:0x0834 0x00).
>> [ 63.638090] [drm] register mmio base: 0xFBB00000
>> [ 63.638092] [drm] register mmio size: 524288
>> [ 63.638261] [drm] add ip block number 0 <soc15_common>
>> [ 63.638263] [drm] add ip block number 1 <gmc_v9_0>
>> [ 63.638265] [drm] add ip block number 2 <vega20_ih>
>> [ 63.638266] [drm] add ip block number 3 <psp>
>> [ 63.638267] [drm] add ip block number 4 <powerplay>
>> [ 63.638269] [drm] add ip block number 5 <dm>
>> [ 63.638271] [drm] add ip block number 6 <gfx_v9_0>
>> [ 63.638272] [drm] add ip block number 7 <sdma_v4_0>
>> [ 63.638273] [drm] add ip block number 8 <uvd_v7_0>
>> [ 63.638275] [drm] add ip block number 9 <vce_v4_0>
>> [ 63.675838] amdgpu 0000:86:00.0: amdgpu: Fetched VBIOS from ROM BAR
>> [ 63.675842] amdgpu: ATOM BIOS: 113-D1630200-112
>> [ 63.675867] [drm] UVD(0) is enabled in VM mode
>> [ 63.675868] [drm] UVD(1) is enabled in VM mode
>> [ 63.675869] [drm] UVD(0) ENC is enabled in VM mode
>> [ 63.675870] [drm] UVD(1) ENC is enabled in VM mode
>> [ 63.675871] [drm] VCE enabled in VM mode
>> [ 63.675873] amdgpu 0000:86:00.0: amdgpu: Trusted Memory Zone (TMZ)
>> feature not supported
>> [ 63.675899] [drm] GPU posting now...
>> [ 63.676276] amdgpu 0000:86:00.0: amdgpu: MEM ECC is active.
>> [ 63.676277] amdgpu 0000:86:00.0: amdgpu: SRAM ECC is active.
>> [ 63.676286] amdgpu 0000:86:00.0: amdgpu: RAS INFO: ras initialized
>> successfully, hardware ability[7fff] ras_mask[7fff]
>> [ 63.676297] [drm] vm size is 262144 GB, 4 levels, block size is
>> 9-bit, fragment size is 9-bit
>> [ 63.676304] amdgpu 0000:86:00.0: amdgpu: VRAM: 32752M
>> 0x0000008800000000 - 0x0000008FFEFFFFFF (32752M used)
>> [ 63.676307] amdgpu 0000:86:00.0: amdgpu: GART: 512M
>> 0x0000000000000000 - 0x000000001FFFFFFF
>> [ 63.676310] amdgpu 0000:86:00.0: amdgpu: AGP: 267845632M
>> 0x0000009000000000 - 0x0000FFFFFFFFFFFF
>> [ 63.676321] [drm] Detected VRAM RAM=32752M, BAR=32768M
>> [ 63.676322] [drm] RAM width 4096bits HBM
>> [ 63.676363] [drm] amdgpu: 32752M of VRAM memory ready
>> [ 63.676365] [drm] amdgpu: 40203M of GTT memory ready.
>> [ 63.676388] [drm] GART: num cpu pages 131072, num gpu pages 131072
>> [ 63.676481] [drm] PCIE GART of 512M enabled.
>> [ 63.676482] [drm] PTB located at 0x0000008800000000
>> [ 63.676730] amdgpu 0000:86:00.0: amdgpu: PSP runtime database
>> doesn't exist
>> [ 63.676733] amdgpu 0000:86:00.0: amdgpu: PSP runtime database
>> doesn't exist
>> [ 63.677088] amdgpu: [powerplay] hwmgr_sw_init smu backed is
>> vega20_smu
>> [ 63.678862] [drm] Found UVD firmware ENC: 1.2 DEC: .43 Family ID: 19
>> [ 63.678918] [drm] PSP loading UVD firmware
>> [ 63.679487] [drm] Found VCE firmware Version: 57.6 Binary ID: 4
>> [ 63.679619] [drm] PSP loading VCE firmware
>> [ 63.831730] [drm] reserve 0x400000 from 0x8ffec00000 for PSP TMR
>> [ 63.914508] amdgpu 0000:86:00.0: amdgpu: HDCP: optional hdcp ta
>> ucode is not available
>> [ 63.914513] amdgpu 0000:86:00.0: amdgpu: DTM: optional dtm ta
>> ucode is not available
>> [ 63.914516] amdgpu 0000:86:00.0: amdgpu: RAP: optional rap ta
>> ucode is not available
>> [ 63.914518] amdgpu 0000:86:00.0: amdgpu: SECUREDISPLAY:
>> securedisplay ta ucode is not available
>> [ 63.917458] [drm] Display Core initialized with v3.2.196!
>> [ 63.919616] [drm] kiq ring mec 2 pipe 1 q 0
>> [ 63.961950] [drm] UVD and UVD ENC initialized successfully.
>> [ 64.160863] [drm] VCE initialized successfully.
>> [ 64.179285] amdgpu: [dbg_xgmi_hive_get] ref_count 4
>> [ 64.179291] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted:
>> G OE 5.15.0-46-generic #49~20.04.1-Ubuntu
>> [ 64.179297] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1
>> 09/14/2018
>> [ 64.179299] Workqueue: events work_for_cpu_fn
>> [ 64.179311] Call Trace:
>> [ 64.179315] <TASK>
>> [ 64.179320] dump_stack_lvl+0x4a/0x63
>> [ 64.179331] dump_stack+0x10/0x16
>> [ 64.179340] amdgpu_get_xgmi_hive+0x217/0x2a0 [amdgpu]
>> [ 64.179904] amdgpu_xgmi_add_device+0xcc/0x450 [amdgpu]
>> [ 64.180318] ? amdgpu_ras_recovery_init+0x13d/0x2e0 [amdgpu]
>> [ 64.180733] ? vce_v4_0_hw_init.cold+0xc/0x13 [amdgpu]
>> [ 64.181321] amdgpu_device_init.cold+0x15bd/0x1fe3 [amdgpu]
>> [ 64.181909] ? pci_bus_read_config_word+0x4a/0x70
>> [ 64.181917] ? do_pci_enable_device+0xdb/0x110
>> [ 64.181923] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
>> [ 64.182315] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
>> [ 64.182703] local_pci_probe+0x4b/0x90
>> [ 64.182710] work_for_cpu_fn+0x1a/0x30
>> [ 64.182715] process_one_work+0x22b/0x3d0
>> [ 64.182722] worker_thread+0x21d/0x3f0
>> [ 64.182728] ? process_one_work+0x3d0/0x3d0
>> [ 64.182734] kthread+0x12a/0x150
>> [ 64.182740] ? set_kthread_struct+0x50/0x50
>> [ 64.182745] ret_from_fork+0x22/0x30
>> [ 64.182756] </TASK>
>> [ 64.184561] amdgpu 0000:86:00.0: amdgpu: XGMI: Add node 1, hive
>> 0x25bbae7e3fd04cf4.
>> [ 64.184568] amdgpu: [dbg_xgmi_hive_get] ref_count 5
>> [ 64.184571] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted:
>> G OE 5.15.0-46-generic #49~20.04.1-Ubuntu
>> [ 64.184576] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1
>> 09/14/2018
>> [ 64.184578] Workqueue: events work_for_cpu_fn
>> [ 64.184585] Call Trace:
>> [ 64.184587] <TASK>
>> [ 64.184589] dump_stack_lvl+0x4a/0x63
>> [ 64.184596] dump_stack+0x10/0x16
>> [ 64.184602] amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu]
>> [ 64.185041] amdgpu_device_init.cold+0x15cd/0x1fe3 [amdgpu]
>> [ 64.185624] ? pci_bus_read_config_word+0x4a/0x70
>> [ 64.185631] ? do_pci_enable_device+0xdb/0x110
>> [ 64.185636] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
>> [ 64.186027] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
>> [ 64.186416] local_pci_probe+0x4b/0x90
>> [ 64.186422] work_for_cpu_fn+0x1a/0x30
>> [ 64.186428] process_one_work+0x22b/0x3d0
>> [ 64.186434] worker_thread+0x21d/0x3f0
>> [ 64.186439] ? process_one_work+0x3d0/0x3d0
>> [ 64.186445] kthread+0x12a/0x150
>> [ 64.186450] ? set_kthread_struct+0x50/0x50
>> [ 64.186455] ret_from_fork+0x22/0x30
>> [ 64.186464] </TASK>
>> [ 64.206119] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
>> [ 64.206433] amdgpu: sdma_bitmap: ffff
>> [ 64.552064] memmap_init_zone_device initialised 8388608 pages in
>> 132ms
>> [ 64.552080] amdgpu: HMM registered 32752MB device memory
>> [ 64.552116] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled!
>>
>> [ 64.552138] amdgpu: Virtual CRAT table created for GPU
>> [ 64.552978] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled!
>>
>> [ 64.552988] amdgpu: Topology: Add dGPU node [0x66a1:0x1002]
>> [ 64.552999] kfd kfd: amdgpu: added device 1002:66a1
>> [ 64.570314] amdgpu 0000:86:00.0: amdgpu: SE 4, SH per SE 1, CU per
>> SH 16, active_cu_number 64
>> [ 64.570527] amdgpu 0000:86:00.0: amdgpu: ring gfx uses VM inv eng
>> 0 on hub 0
>> [ 64.570531] amdgpu 0000:86:00.0: amdgpu: ring comp_1.0.0 uses VM
>> inv eng 1 on hub 0
>> [ 64.570535] amdgpu 0000:86:00.0: amdgpu: ring comp_1.1.0 uses VM
>> inv eng 4 on hub 0
>> [ 64.570538] amdgpu 0000:86:00.0: amdgpu: ring comp_1.2.0 uses VM
>> inv eng 5 on hub 0
>> [ 64.570541] amdgpu 0000:86:00.0: amdgpu: ring comp_1.3.0 uses VM
>> inv eng 6 on hub 0
>> [ 64.570544] amdgpu 0000:86:00.0: amdgpu: ring comp_1.0.1 uses VM
>> inv eng 7 on hub 0
>> [ 64.570547] amdgpu 0000:86:00.0: amdgpu: ring comp_1.1.1 uses VM
>> inv eng 8 on hub 0
>> [ 64.570550] amdgpu 0000:86:00.0: amdgpu: ring comp_1.2.1 uses VM
>> inv eng 9 on hub 0
>> [ 64.570552] amdgpu 0000:86:00.0: amdgpu: ring comp_1.3.1 uses VM
>> inv eng 10 on hub 0
>> [ 64.570556] amdgpu 0000:86:00.0: amdgpu: ring kiq_2.1.0 uses VM
>> inv eng 11 on hub 0
>> [ 64.570559] amdgpu 0000:86:00.0: amdgpu: ring sdma0 uses VM inv
>> eng 0 on hub 1
>> [ 64.570562] amdgpu 0000:86:00.0: amdgpu: ring page0 uses VM inv
>> eng 1 on hub 1
>> [ 64.570565] amdgpu 0000:86:00.0: amdgpu: ring sdma1 uses VM inv
>> eng 4 on hub 1
>> [ 64.570567] amdgpu 0000:86:00.0: amdgpu: ring page1 uses VM inv
>> eng 5 on hub 1
>> [ 64.570570] amdgpu 0000:86:00.0: amdgpu: ring uvd_0 uses VM inv
>> eng 6 on hub 1
>> [ 64.570573] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_0.0 uses VM
>> inv eng 7 on hub 1
>> [ 64.570576] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_0.1 uses VM
>> inv eng 8 on hub 1
>> [ 64.570579] amdgpu 0000:86:00.0: amdgpu: ring uvd_1 uses VM inv
>> eng 9 on hub 1
>> [ 64.570581] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_1.0 uses VM
>> inv eng 10 on hub 1
>> [ 64.570584] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_1.1 uses VM
>> inv eng 11 on hub 1
>> [ 64.570587] amdgpu 0000:86:00.0: amdgpu: ring vce0 uses VM inv eng
>> 12 on hub 1
>> [ 64.570589] amdgpu 0000:86:00.0: amdgpu: ring vce1 uses VM inv eng
>> 13 on hub 1
>> [ 64.570592] amdgpu 0000:86:00.0: amdgpu: ring vce2 uses VM inv eng
>> 14 on hub 1
>> [ 64.581070] amdgpu: [dbg_xgmi_hive_get] ref_count 6
>> [ 64.581075] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted:
>> G OE 5.15.0-46-generic #49~20.04.1-Ubuntu
>> [ 64.581079] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1
>> 09/14/2018
>> [ 64.581081] Workqueue: events work_for_cpu_fn
>> [ 64.581089] Call Trace:
>> [ 64.581091] <TASK>
>> [ 64.581094] dump_stack_lvl+0x4a/0x63
>> [ 64.581103] dump_stack+0x10/0x16
>> [ 64.581109] amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu]
>> [ 64.581489] amdgpu_xgmi_set_pstate+0xe/0x30 [amdgpu]
>> [ 64.581723] amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu]
>> [ 64.581943] amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu]
>> [ 64.582288] ? pci_bus_read_config_word+0x4a/0x70
>> [ 64.582295] ? do_pci_enable_device+0xdb/0x110
>> [ 64.582298] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
>> [ 64.582520] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
>> [ 64.582738] local_pci_probe+0x4b/0x90
>> [ 64.582743] work_for_cpu_fn+0x1a/0x30
>> [ 64.582746] process_one_work+0x22b/0x3d0
>> [ 64.582750] worker_thread+0x21d/0x3f0
>> [ 64.582753] ? process_one_work+0x3d0/0x3d0
>> [ 64.582756] kthread+0x12a/0x150
>> [ 64.582761] ? set_kthread_struct+0x50/0x50
>> [ 64.582764] ret_from_fork+0x22/0x30
>> [ 64.582772] </TASK>
>> [ 64.582774] amdgpu: [dbg_xgmi_hive_put] ref_count 5
>> [ 64.582775] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted:
>> G OE 5.15.0-46-generic #49~20.04.1-Ubuntu
>> [ 64.582778] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1
>> 09/14/2018
>> [ 64.582779] Workqueue: events work_for_cpu_fn
>> [ 64.582782] Call Trace:
>> [ 64.582783] <TASK>
>> [ 64.582784] dump_stack_lvl+0x4a/0x63
>> [ 64.582789] dump_stack+0x10/0x16
>> [ 64.582792] amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
>> [ 64.583028] amdgpu_xgmi_set_pstate+0x1b/0x30 [amdgpu]
>> [ 64.583262] amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu]
>> [ 64.583482] amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu]
>> [ 64.583833] ? pci_bus_read_config_word+0x4a/0x70
>> [ 64.583836] ? do_pci_enable_device+0xdb/0x110
>> [ 64.583840] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
>> [ 64.584072] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
>> [ 64.584304] local_pci_probe+0x4b/0x90
>> [ 64.584307] work_for_cpu_fn+0x1a/0x30
>> [ 64.584311] process_one_work+0x22b/0x3d0
>> [ 64.584314] worker_thread+0x21d/0x3f0
>> [ 64.584318] ? process_one_work+0x3d0/0x3d0
>> [ 64.584321] kthread+0x12a/0x150
>> [ 64.584324] ? set_kthread_struct+0x50/0x50
>> [ 64.584327] ret_from_fork+0x22/0x30
>> [ 64.584333] </TASK>
>> [ 64.584342] amdgpu: [dbg_xgmi_hive_get] ref_count 6
>> [ 64.584344] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted:
>> G OE 5.15.0-46-generic #49~20.04.1-Ubuntu
>> [ 64.584347] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1
>> 09/14/2018
>> [ 64.584348] Workqueue: events work_for_cpu_fn
>> [ 64.584352] Call Trace:
>> [ 64.584353] <TASK>
>> [ 64.584354] dump_stack_lvl+0x4a/0x63
>> [ 64.584358] dump_stack+0x10/0x16
>> [ 64.584362] amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu]
>> [ 64.584610] amdgpu_xgmi_set_pstate+0xe/0x30 [amdgpu]
>> [ 64.584856] amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu]
>> [ 64.585086] amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu]
>> [ 64.585437] ? pci_bus_read_config_word+0x4a/0x70
>> [ 64.585440] ? do_pci_enable_device+0xdb/0x110
>> [ 64.585443] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
>> [ 64.585679] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
>> [ 64.585922] local_pci_probe+0x4b/0x90
>> [ 64.585926] work_for_cpu_fn+0x1a/0x30
>> [ 64.585929] process_one_work+0x22b/0x3d0
>> [ 64.585932] worker_thread+0x21d/0x3f0
>> [ 64.585936] ? process_one_work+0x3d0/0x3d0
>> [ 64.585939] kthread+0x12a/0x150
>> [ 64.585942] ? set_kthread_struct+0x50/0x50
>> [ 64.585945] ret_from_fork+0x22/0x30
>> [ 64.585950] </TASK>
>> [ 64.585951] amdgpu: [dbg_xgmi_hive_put] ref_count 5
>> [ 64.585953] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted:
>> G OE 5.15.0-46-generic #49~20.04.1-Ubuntu
>> [ 64.585956] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1
>> 09/14/2018
>> [ 64.585957] Workqueue: events work_for_cpu_fn
>> [ 64.585960] Call Trace:
>> [ 64.585961] <TASK>
>> [ 64.585963] dump_stack_lvl+0x4a/0x63
>> [ 64.585967] dump_stack+0x10/0x16
>> [ 64.585970] amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
>> [ 64.586213] amdgpu_xgmi_set_pstate+0x1b/0x30 [amdgpu]
>> [ 64.586458] amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu]
>> [ 64.586688] amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu]
>> [ 64.587037] ? pci_bus_read_config_word+0x4a/0x70
>> [ 64.587040] ? do_pci_enable_device+0xdb/0x110
>> [ 64.587043] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
>> [ 64.587277] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
>> [ 64.587509] local_pci_probe+0x4b/0x90
>> [ 64.587512] work_for_cpu_fn+0x1a/0x30
>> [ 64.587515] process_one_work+0x22b/0x3d0
>> [ 64.587519] worker_thread+0x21d/0x3f0
>> [ 64.587523] ? process_one_work+0x3d0/0x3d0
>> [ 64.587526] kthread+0x12a/0x150
>> [ 64.587529] ? set_kthread_struct+0x50/0x50
>> [ 64.587532] ret_from_fork+0x22/0x30
>> [ 64.587537] </TASK>
>> [ 64.587619] amdgpu: Detected AMDGPU DF Counters. # of Counters = 8.
>> [ 64.587663] amdgpu: Detected AMDGPU 2 Perf Events.
>> [ 64.588081] [drm] Initialized amdgpu 3.48.0 20150101 for
>> 0000:86:00.0 on minor 2
>>
>> Then driver unload (reference stuck at 2):
>> [ 110.117018] amdgpu 0000:86:00.0: amdgpu: amdgpu: finishing device.
>> [ 110.131638] [drm] free PSP TMR buffer
>> [ 110.420529] amdgpu: [dbg_xgmi_hive_put] ref_count 4
>> [ 110.420537] CPU: 27 PID: 1748 Comm: modprobe Tainted: G
>> OE 5.15.0-46-generic #49~20.04.1-Ubuntu
>> [ 110.420545] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1
>> 09/14/2018
>> [ 110.420548] Call Trace:
>> [ 110.420551] <TASK>
>> [ 110.420556] dump_stack_lvl+0x4a/0x63
>> [ 110.420569] dump_stack+0x10/0x16
>> [ 110.420578] amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
>> [ 110.421001] amdgpu_xgmi_remove_device+0x11d/0x1c0 [amdgpu]
>> [ 110.421380] amdgpu_device_fini_sw+0x63/0x4c0 [amdgpu]
>> [ 110.421724] amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
>> [ 110.422070] drm_dev_release+0x28/0x50 [drm]
>> [ 110.422145] devm_drm_dev_init_release+0x38/0x60 [drm]
>> [ 110.422190] devm_action_release+0x15/0x20
>> [ 110.422198] release_nodes+0x40/0xb0
>> [ 110.422205] devres_release_all+0x9e/0xe0
>> [ 110.422212] device_release_driver_internal+0x117/0x1f0
>> [ 110.422218] driver_detach+0x4c/0xa0
>> [ 110.422222] bus_remove_driver+0x6c/0xf0
>> [ 110.422227] driver_unregister+0x31/0x50
>> [ 110.422231] pci_unregister_driver+0x40/0x90
>> [ 110.422238] amdgpu_exit+0x15/0x446 [amdgpu]
>> [ 110.422791] __x64_sys_delete_module+0x14e/0x260
>> [ 110.422801] ? do_syscall_64+0x69/0xc0
>> [ 110.422809] ? __x64_sys_read+0x1a/0x20
>> [ 110.422817] ? do_syscall_64+0x69/0xc0
>> [ 110.422821] ? ksys_read+0x67/0xf0
>> [ 110.422825] do_syscall_64+0x5c/0xc0
>> [ 110.422830] ? __x64_sys_read+0x1a/0x20
>> [ 110.422834] ? do_syscall_64+0x69/0xc0
>> [ 110.422839] ? syscall_exit_to_user_mode+0x27/0x50
>> [ 110.422846] ? __x64_sys_openat+0x20/0x30
>> [ 110.422853] ? do_syscall_64+0x69/0xc0
>> [ 110.422857] ? do_syscall_64+0x69/0xc0
>> [ 110.422862] ? irqentry_exit+0x1d/0x30
>> [ 110.422868] ? exc_page_fault+0x89/0x170
>> [ 110.422874] entry_SYSCALL_64_after_hwframe+0x61/0xcb
>> [ 110.422885] RIP: 0033:0x7f1576682a6b
>> [ 110.422892] Code: 73 01 c3 48 8b 0d 25 c4 0c 00 f7 d8 64 89 01 48
>> 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00
>> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 c3 0c 00 f7 d8 64
>> 89 01 48
>> [ 110.422897] RSP: 002b:00007ffcb96e0bf8 EFLAGS: 00000206 ORIG_RAX:
>> 00000000000000b0
>> [ 110.422904] RAX: ffffffffffffffda RBX: 000056347ba57550 RCX:
>> 00007f1576682a6b
>> [ 110.422908] RDX: 0000000000000000 RSI: 0000000000000800 RDI:
>> 000056347ba575b8
>> [ 110.422911] RBP: 000056347ba57550 R08: 0000000000000000 R09:
>> 0000000000000000
>> [ 110.422913] R10: 00007f15766feac0 R11: 0000000000000206 R12:
>> 000056347ba575b8
>> [ 110.422916] R13: 0000000000000000 R14: 000056347ba575b8 R15:
>> 000056347ba57550
>> [ 110.422921] </TASK>
>> [ 110.425941] [drm] amdgpu: ttm finalized
>> [ 110.489186] amdgpu 0000:83:00.0: amdgpu: amdgpu: finishing device.
>> [ 110.504025] [drm] free PSP TMR buffer
>> [ 110.762272] amdgpu: [dbg_xgmi_hive_put] ref_count 3
>> [ 110.762280] CPU: 27 PID: 1748 Comm: modprobe Tainted: G
>> OE 5.15.0-46-generic #49~20.04.1-Ubuntu
>> [ 110.762288] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1
>> 09/14/2018
>> [ 110.762290] Call Trace:
>> [ 110.762294] <TASK>
>> [ 110.762298] dump_stack_lvl+0x4a/0x63
>> [ 110.762313] dump_stack+0x10/0x16
>> [ 110.762319] amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
>> [ 110.762663] amdgpu_xgmi_remove_device+0x11d/0x1c0 [amdgpu]
>> [ 110.762965] amdgpu_device_fini_sw+0x63/0x4c0 [amdgpu]
>> [ 110.763231] amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
>> [ 110.763519] drm_dev_release+0x28/0x50 [drm]
>> [ 110.763569] devm_drm_dev_init_release+0x38/0x60 [drm]
>> [ 110.763609] devm_action_release+0x15/0x20
>> [ 110.763617] release_nodes+0x40/0xb0
>> [ 110.763624] devres_release_all+0x9e/0xe0
>> [ 110.763631] device_release_driver_internal+0x117/0x1f0
>> [ 110.763636] driver_detach+0x4c/0xa0
>> [ 110.763640] bus_remove_driver+0x6c/0xf0
>> [ 110.763646] driver_unregister+0x31/0x50
>> [ 110.763650] pci_unregister_driver+0x40/0x90
>> [ 110.763657] amdgpu_exit+0x15/0x446 [amdgpu]
>> [ 110.764153] __x64_sys_delete_module+0x14e/0x260
>> [ 110.764164] ? do_syscall_64+0x69/0xc0
>> [ 110.764172] ? __x64_sys_read+0x1a/0x20
>> [ 110.764180] ? do_syscall_64+0x69/0xc0
>> [ 110.764184] ? ksys_read+0x67/0xf0
>> [ 110.764189] do_syscall_64+0x5c/0xc0
>> [ 110.764193] ? __x64_sys_read+0x1a/0x20
>> [ 110.764197] ? do_syscall_64+0x69/0xc0
>> [ 110.764202] ? syscall_exit_to_user_mode+0x27/0x50
>> [ 110.764209] ? __x64_sys_openat+0x20/0x30
>> [ 110.764217] ? do_syscall_64+0x69/0xc0
>> [ 110.764221] ? do_syscall_64+0x69/0xc0
>> [ 110.764226] ? irqentry_exit+0x1d/0x30
>> [ 110.764232] ? exc_page_fault+0x89/0x170
>> [ 110.764238] entry_SYSCALL_64_after_hwframe+0x61/0xcb
>> [ 110.764248] RIP: 0033:0x7f1576682a6b
>> [ 110.764255] Code: 73 01 c3 48 8b 0d 25 c4 0c 00 f7 d8 64 89 01 48
>> 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00
>> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 c3 0c 00 f7 d8 64
>> 89 01 48
>> [ 110.764260] RSP: 002b:00007ffcb96e0bf8 EFLAGS: 00000206 ORIG_RAX:
>> 00000000000000b0
>> [ 110.764267] RAX: ffffffffffffffda RBX: 000056347ba57550 RCX:
>> 00007f1576682a6b
>> [ 110.764270] RDX: 0000000000000000 RSI: 0000000000000800 RDI:
>> 000056347ba575b8
>> [ 110.764273] RBP: 000056347ba57550 R08: 0000000000000000 R09:
>> 0000000000000000
>> [ 110.764275] R10: 00007f15766feac0 R11: 0000000000000206 R12:
>> 000056347ba575b8
>> [ 110.764278] R13: 0000000000000000 R14: 000056347ba575b8 R15:
>> 000056347ba57550
>> [ 110.764283] </TASK>
>> [ 110.764326] amdgpu: [dbg_xgmi_hive_put] ref_count 2
>> [ 110.764329] CPU: 27 PID: 1748 Comm: modprobe Tainted: G
>> OE 5.15.0-46-generic #49~20.04.1-Ubuntu
>> [ 110.764334] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1
>> 09/14/2018
>> [ 110.764336] Call Trace:
>> [ 110.764337] <TASK>
>> [ 110.764339] dump_stack_lvl+0x4a/0x63
>> [ 110.764347] dump_stack+0x10/0x16
>> [ 110.764354] amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
>> [ 110.764624] amdgpu_xgmi_remove_device+0x1ad/0x1c0 [amdgpu]
>> [ 110.764791] amdgpu_device_fini_sw+0x63/0x4c0 [amdgpu]
>> [ 110.764937] amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
>> [ 110.765085] drm_dev_release+0x28/0x50 [drm]
>> [ 110.765108] devm_drm_dev_init_release+0x38/0x60 [drm]
>> [ 110.765130] devm_action_release+0x15/0x20
>> [ 110.765134] release_nodes+0x40/0xb0
>> [ 110.765137] devres_release_all+0x9e/0xe0
>> [ 110.765141] device_release_driver_internal+0x117/0x1f0
>> [ 110.765144] driver_detach+0x4c/0xa0
>> [ 110.765146] bus_remove_driver+0x6c/0xf0
>> [ 110.765148] driver_unregister+0x31/0x50
>> [ 110.765150] pci_unregister_driver+0x40/0x90
>> [ 110.765154] amdgpu_exit+0x15/0x446 [amdgpu]
>> [ 110.765434] __x64_sys_delete_module+0x14e/0x260
>> [ 110.765438] ? do_syscall_64+0x69/0xc0
>> [ 110.765441] ? __x64_sys_read+0x1a/0x20
>> [ 110.765444] ? do_syscall_64+0x69/0xc0
>> [ 110.765446] ? ksys_read+0x67/0xf0
>> [ 110.765449] do_syscall_64+0x5c/0xc0
>> [ 110.765451] ? __x64_sys_read+0x1a/0x20
>> [ 110.765454] ? do_syscall_64+0x69/0xc0
>> [ 110.765456] ? syscall_exit_to_user_mode+0x27/0x50
>> [ 110.765460] ? __x64_sys_openat+0x20/0x30
>> [ 110.765464] ? do_syscall_64+0x69/0xc0
>> [ 110.765466] ? do_syscall_64+0x69/0xc0
>> [ 110.765469] ? irqentry_exit+0x1d/0x30
>> [ 110.765472] ? exc_page_fault+0x89/0x170
>> [ 110.765476] entry_SYSCALL_64_after_hwframe+0x61/0xcb
>> [ 110.765480] RIP: 0033:0x7f1576682a6b
>> [ 110.765482] Code: 73 01 c3 48 8b 0d 25 c4 0c 00 f7 d8 64 89 01 48
>> 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00
>> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 c3 0c 00 f7 d8 64
>> 89 01 48
>> [ 110.765485] RSP: 002b:00007ffcb96e0bf8 EFLAGS: 00000206 ORIG_RAX:
>> 00000000000000b0
>> [ 110.765488] RAX: ffffffffffffffda RBX: 000056347ba57550 RCX:
>> 00007f1576682a6b
>> [ 110.765489] RDX: 0000000000000000 RSI: 0000000000000800 RDI:
>> 000056347ba575b8
>> [ 110.765491] RBP: 000056347ba57550 R08: 0000000000000000 R09:
>> 0000000000000000
>> [ 110.765492] R10: 00007f15766feac0 R11: 0000000000000206 R12:
>> 000056347ba575b8
>> [ 110.765494] R13: 0000000000000000 R14: 000056347ba575b8 R15:
>> 000056347ba57550
>> [ 110.765496] </TASK>
>> [ 110.768091] [drm] amdgpu: ttm finalized
>>
>>> -----Original Message-----
>>> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>>> Sent: August 11, 2022 12:43 PM
>>> To: Kim, Jonathan <Jonathan.Kim at amd.com>; Kuehling, Felix
>>> <Felix.Kuehling at amd.com>; amd-gfx at lists.freedesktop.org
>>> Subject: Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info
>>> reference
>>> leak
>>>
>>>
>>> On 2022-08-11 11:34, Kim, Jonathan wrote:
>>>> [Public]
>>>>
>>>>> -----Original Message-----
>>>>> From: Kuehling, Felix <Felix.Kuehling at amd.com>
>>>>> Sent: August 11, 2022 11:19 AM
>>>>> To: amd-gfx at lists.freedesktop.org; Kim, Jonathan
>>> <Jonathan.Kim at amd.com>
>>>>> Subject: Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info
>>>>> reference
>>>>> leak
>>>>>
>>>>> Am 2022-08-11 um 09:42 schrieb Jonathan Kim:
>>>>>> When an xgmi node is added to the hive, it takes another hive
>>>>>> reference for its reset domain.
>>>>>>
>>>>>> This extra reference was not dropped on device removal from the
>>>>>> hive so drop it.
>>>>>>
>>>>>> Signed-off-by: Jonathan Kim <jonathan.kim at amd.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 3 +++
>>>>>> 1 file changed, 3 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>>>>>> index 1b108d03e785..560bf1c98f08 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>>>>>> @@ -731,6 +731,9 @@ int amdgpu_xgmi_remove_device(struct
>>>>> amdgpu_device *adev)
>>>>>> mutex_unlock(&hive->hive_lock);
>>>>>>
>>>>>> amdgpu_put_xgmi_hive(hive);
>>>>>> + /* device is removed from the hive so remove its reset domain
>>>>> reference */
>>>>>> + if (adev->reset_domain && adev->reset_domain == hive-
>>>>>> reset_domain)
>>>>>> + amdgpu_put_xgmi_hive(hive);
>>>>> This is some messed up reference counting. If you need an extra
>>>>> reference from the reset_domain to the hive, that should be owned
>>>>> by the
>>>>> reset_domain and dropped when the reset_domain is destroyed. And it's
>>>>> only one reference for the reset_domain, not one reference per
>>>>> adev in
>>>>> the reset_domain.
>>>> Cc'ing Andrey.
>>>>
>>>> What you're saying seems to make more sense to me, but what I got
>>>> from an
>>> offline conversation with Andrey
>>>> was that the reset domain reference per device was intentional.
>>>> Maybe Andrey can comment here.
>>>>
>>>>> What you're doing here looks like every adev that's in a
>>>>> reset_domain of
>>>>> its hive has two references to the hive. And if you're dropping the
>>>>> extra reference here, it still leaves the reset_domain with a
>>>>> dangling
>>>>> pointer to a hive that may no longer exist. So this extra
>>>>> reference is
>>>>> kind of pointless.
>>>
>>> reset_domain doesn't have any references to the hive, the hive has a
>>> reference to reset_domain
>>>
>>>
>>>> Yes. Currently one reference is fetched from the device's lifetime
>>>> on the hive
>>> and the other is from the
>>>> per-device reset domain.
>>>>
>>>> Snippet from amdgpu_device_ip_init:
>>>> /**
>>>> * In case of XGMI grab extra reference for reset domain
>>>> for this device
>>>> */
>>>> if (adev->gmc.xgmi.num_physical_nodes > 1) {
>>>> if (amdgpu_xgmi_add_device(adev) == 0) { <- [JK]
>>>> reference is
>>> fetched here
>>>
>>>
>>> amdgpu_xgmi_add_device calls amdgpu_get_xgmi_hive and only on the
>>> first
>>> time amdgpu_get_xgmi_hive is called and hive is actually allocated and
>>> initialized will we proceed
>>> to creating the reset domain either from scratch (first creation of the
>>> hive) or by taking reference from adev (see [1])
>>>
>>>
>>>
>>> [1] -
>>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/a
>>>
>>> mdgpu_xgmi.c#L394
>>>
>>>> struct amdgpu_hive_info *hive =
>>>> amdgpu_get_xgmi_hive(adev);
>>> <- [JK] then here again
>>>
>>>
>>> So here I don't see how an extra reference to reset_domain is taken if
>>> amdgpu_get_xgmi_hive returns early since the hive already created and
>>> exists in the global hive container ?
>>>
>>> Johantan - can u please show the exact flow how recount leak on
>>> reset_domain is happening ?
>>>
>>> Andrey
>>>
>>>
>>>> if (!hive->reset_domain ||
>>>> !amdgpu_reset_get_reset_domain(hive->reset_domain)) {
>>>> r = -ENOENT;
>>>> goto init_failed;
>>>> }
>>>>
>>>> /* Drop the early temporary reset domain
>>>> we created for device
>>> */
>>>> amdgpu_reset_put_reset_domain(adev->reset_domain);
>>>> adev->reset_domain = hive->reset_domain;
>>>> }
>>>> }
>>>>
>>>> One of these never gets dropped so a leak happens.
>>>> So either the extra reference has to be dropped on device removal
>>>> from the
>>> hive or from what you've mentioned,
>>>> the reset_domain reference fetch should be fixed to grab at the
>>> hive/reset_domain level.
>>>> Thanks,
>>>>
>>>> Jon
>>>>
>>>>> Regards,
>>>>> Felix
>>>>>
>>>>>
>>>>>> adev->hive = NULL;
>>>>>>
>>>>>> if (atomic_dec_return(&hive->number_devices) == 0) {
More information about the amd-gfx
mailing list