[PATCH] drm/amdgpu: fix reset domain xgmi hive info reference leak
Kim, Jonathan
Jonathan.Kim at amd.com
Fri Aug 12 18:38:36 UTC 2022
[Public]
Hi Andrey,
Here's the load/unload stack trace. This is a 2 GPU xGMI system. I put dbg_xgmi_hive_get/put refcount print post kobj get/put.
It's stuck at 2 on unload. If it's an 8 GPU system, it's stuck at 8.
e.g. of sysfs leak after driver unload:
atitest at atitest:/sys/devices/pci0000:80/0000:80:02.0/0000:81:00.0/0000:82:00.0/0000:83:00.0$ ls xgmi_hive_info/
xgmi_hive_id
Thanks,
Jon
Driver load (get ref happens on both device add to hive and init per device):
[ 61.975900] amdkcl: loading out-of-tree module taints kernel.
[ 61.975973] amdkcl: module verification failed: signature and/or required key missing - tainting kernel
[ 62.065546] amdkcl: Warning: fail to get symbol cancel_work, replace it with kcl stub
[ 62.081920] AMD-Vi: AMD IOMMUv2 functionality not available on this system - This is not a bug.
[ 62.491119] [drm] amdgpu kernel modesetting enabled.
[ 62.491122] [drm] amdgpu version: 5.18.2
[ 62.491124] [drm] OS DRM version: 5.15.0
[ 62.491337] amdgpu: CRAT table not found
[ 62.491341] amdgpu: Virtual CRAT table created for CPU
[ 62.491360] amdgpu: Topology: Add CPU node
[ 62.603556] amdgpu: PeerDirect support was initialized successfully
[ 62.603847] amdgpu 0000:83:00.0: enabling device (0100 -> 0102)
[ 62.603987] [drm] initializing kernel modesetting (VEGA20 0x1002:0x66A1 0x1002:0x0834 0x00).
[ 62.604023] [drm] register mmio base: 0xFBD00000
[ 62.604026] [drm] register mmio size: 524288
[ 62.604171] [drm] add ip block number 0 <soc15_common>
[ 62.604175] [drm] add ip block number 1 <gmc_v9_0>
[ 62.604177] [drm] add ip block number 2 <vega20_ih>
[ 62.604180] [drm] add ip block number 3 <psp>
[ 62.604182] [drm] add ip block number 4 <powerplay>
[ 62.604185] [drm] add ip block number 5 <dm>
[ 62.604187] [drm] add ip block number 6 <gfx_v9_0>
[ 62.604190] [drm] add ip block number 7 <sdma_v4_0>
[ 62.604192] [drm] add ip block number 8 <uvd_v7_0>
[ 62.604194] [drm] add ip block number 9 <vce_v4_0>
[ 62.641771] amdgpu 0000:83:00.0: amdgpu: Fetched VBIOS from ROM BAR
[ 62.641777] amdgpu: ATOM BIOS: 113-D1630200-112
[ 62.713418] [drm] UVD(0) is enabled in VM mode
[ 62.713423] [drm] UVD(1) is enabled in VM mode
[ 62.713426] [drm] UVD(0) ENC is enabled in VM mode
[ 62.713428] [drm] UVD(1) ENC is enabled in VM mode
[ 62.713430] [drm] VCE enabled in VM mode
[ 62.713433] amdgpu 0000:83:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[ 62.713472] [drm] GPU posting now...
[ 62.713993] amdgpu 0000:83:00.0: amdgpu: MEM ECC is active.
[ 62.713995] amdgpu 0000:83:00.0: amdgpu: SRAM ECC is active.
[ 62.714006] amdgpu 0000:83:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7fff] ras_mask[7fff]
[ 62.714018] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 62.714026] amdgpu 0000:83:00.0: amdgpu: VRAM: 32752M 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
[ 62.714029] amdgpu 0000:83:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[ 62.714032] amdgpu 0000:83:00.0: amdgpu: AGP: 267845632M 0x0000009000000000 - 0x0000FFFFFFFFFFFF
[ 62.714043] [drm] Detected VRAM RAM=32752M, BAR=32768M
[ 62.714044] [drm] RAM width 4096bits HBM
[ 62.714050] debugfs: Directory 'ttm' with parent '/' already present!
[ 62.714146] [drm] amdgpu: 32752M of VRAM memory ready
[ 62.714149] [drm] amdgpu: 40203M of GTT memory ready.
[ 62.714170] [drm] GART: num cpu pages 131072, num gpu pages 131072
[ 62.714266] [drm] PCIE GART of 512M enabled.
[ 62.714267] [drm] PTB located at 0x0000008000000000
[ 62.731067] amdgpu 0000:83:00.0: amdgpu: PSP runtime database doesn't exist
[ 62.731075] amdgpu 0000:83:00.0: amdgpu: PSP runtime database doesn't exist
[ 62.731449] amdgpu: [powerplay] hwmgr_sw_init smu backed is vega20_smu
[ 62.743177] [drm] Found UVD firmware ENC: 1.2 DEC: .43 Family ID: 19
[ 62.743244] [drm] PSP loading UVD firmware
[ 62.744525] [drm] Found VCE firmware Version: 57.6 Binary ID: 4
[ 62.744689] [drm] PSP loading VCE firmware
[ 62.896804] [drm] reserve 0x400000 from 0x87fec00000 for PSP TMR
[ 62.979421] amdgpu 0000:83:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available
[ 62.979427] amdgpu 0000:83:00.0: amdgpu: DTM: optional dtm ta ucode is not available
[ 62.979430] amdgpu 0000:83:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 62.979432] amdgpu 0000:83:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[ 62.982386] [drm] Display Core initialized with v3.2.196!
[ 62.984514] [drm] kiq ring mec 2 pipe 1 q 0
[ 63.026846] [drm] UVD and UVD ENC initialized successfully.
[ 63.225760] [drm] VCE initialized successfully.
[ 63.244442] amdgpu: [dbg_xgmi_hive_get] ref_count 2
[ 63.244448] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G OE 5.15.0-46-generic #49~20.04.1-Ubuntu
[ 63.244454] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
[ 63.244457] Workqueue: events work_for_cpu_fn
[ 63.244471] Call Trace:
[ 63.244474] <TASK>
[ 63.244479] dump_stack_lvl+0x4a/0x63
[ 63.244493] dump_stack+0x10/0x16
[ 63.244501] amdgpu_get_xgmi_hive+0x217/0x2a0 [amdgpu]
[ 63.245047] amdgpu_xgmi_add_device+0xcc/0x450 [amdgpu]
[ 63.245463] ? amdgpu_ras_recovery_init+0x13d/0x2e0 [amdgpu]
[ 63.245879] ? vce_v4_0_hw_init.cold+0xc/0x13 [amdgpu]
[ 63.246466] amdgpu_device_init.cold+0x15bd/0x1fe3 [amdgpu]
[ 63.247055] ? pci_bus_read_config_word+0x4a/0x70
[ 63.247064] ? do_pci_enable_device+0xdb/0x110
[ 63.247070] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
[ 63.247463] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
[ 63.247868] local_pci_probe+0x4b/0x90
[ 63.247876] work_for_cpu_fn+0x1a/0x30
[ 63.247881] process_one_work+0x22b/0x3d0
[ 63.247887] worker_thread+0x21d/0x3f0
[ 63.247893] ? process_one_work+0x3d0/0x3d0
[ 63.247898] kthread+0x12a/0x150
[ 63.247905] ? set_kthread_struct+0x50/0x50
[ 63.247910] ret_from_fork+0x22/0x30
[ 63.247922] </TASK>
[ 63.248563] amdgpu 0000:83:00.0: amdgpu: XGMI: Add node 0, hive 0x25bbae7e3fd04cf4.
[ 63.248569] amdgpu: [dbg_xgmi_hive_get] ref_count 3
[ 63.248572] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G OE 5.15.0-46-generic #49~20.04.1-Ubuntu
[ 63.248578] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
[ 63.248580] Workqueue: events work_for_cpu_fn
[ 63.248587] Call Trace:
[ 63.248588] <TASK>
[ 63.248590] dump_stack_lvl+0x4a/0x63
[ 63.248598] dump_stack+0x10/0x16
[ 63.248604] amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu]
[ 63.249033] amdgpu_device_init.cold+0x15cd/0x1fe3 [amdgpu]
[ 63.249621] ? pci_bus_read_config_word+0x4a/0x70
[ 63.249627] ? do_pci_enable_device+0xdb/0x110
[ 63.249632] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
[ 63.250022] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
[ 63.250410] local_pci_probe+0x4b/0x90
[ 63.250416] work_for_cpu_fn+0x1a/0x30
[ 63.250421] process_one_work+0x22b/0x3d0
[ 63.250428] worker_thread+0x21d/0x3f0
[ 63.250434] ? process_one_work+0x3d0/0x3d0
[ 63.250440] kthread+0x12a/0x150
[ 63.250445] ? set_kthread_struct+0x50/0x50
[ 63.250450] ret_from_fork+0x22/0x30
[ 63.250458] </TASK>
[ 63.268869] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[ 63.269180] amdgpu: sdma_bitmap: ffff
[ 63.605188] memmap_init_zone_device initialised 8388608 pages in 132ms
[ 63.605203] amdgpu: HMM registered 32752MB device memory
[ 63.605244] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled!
[ 63.605263] amdgpu: Virtual CRAT table created for GPU
[ 63.605651] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled!
[ 63.605659] amdgpu: Topology: Add dGPU node [0x66a1:0x1002]
[ 63.605670] kfd kfd: amdgpu: added device 1002:66a1
[ 63.626300] amdgpu 0000:83:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 16, active_cu_number 64
[ 63.626517] amdgpu 0000:83:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
[ 63.626522] amdgpu 0000:83:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 63.626525] amdgpu 0000:83:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 63.626529] amdgpu 0000:83:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[ 63.626531] amdgpu 0000:83:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[ 63.626534] amdgpu 0000:83:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[ 63.626537] amdgpu 0000:83:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[ 63.626540] amdgpu 0000:83:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[ 63.626543] amdgpu 0000:83:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[ 63.626546] amdgpu 0000:83:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[ 63.626549] amdgpu 0000:83:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
[ 63.626552] amdgpu 0000:83:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 1
[ 63.626555] amdgpu 0000:83:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 1
[ 63.626558] amdgpu 0000:83:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 1
[ 63.626561] amdgpu 0000:83:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 1
[ 63.626563] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
[ 63.626566] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
[ 63.626569] amdgpu 0000:83:00.0: amdgpu: ring uvd_1 uses VM inv eng 9 on hub 1
[ 63.626572] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_1.0 uses VM inv eng 10 on hub 1
[ 63.626575] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_1.1 uses VM inv eng 11 on hub 1
[ 63.626577] amdgpu 0000:83:00.0: amdgpu: ring vce0 uses VM inv eng 12 on hub 1
[ 63.626580] amdgpu 0000:83:00.0: amdgpu: ring vce1 uses VM inv eng 13 on hub 1
[ 63.626583] amdgpu 0000:83:00.0: amdgpu: ring vce2 uses VM inv eng 14 on hub 1
[ 63.636996] amdgpu: Detected AMDGPU DF Counters. # of Counters = 8.
[ 63.637046] amdgpu: Detected AMDGPU 2 Perf Events.
[ 63.637428] [drm] Initialized amdgpu 3.48.0 20150101 for 0000:83:00.0 on minor 1
[ 63.637937] amdgpu 0000:86:00.0: enabling device (0100 -> 0102)
[ 63.638043] [drm] initializing kernel modesetting (VEGA20 0x1002:0x66A1 0x1002:0x0834 0x00).
[ 63.638090] [drm] register mmio base: 0xFBB00000
[ 63.638092] [drm] register mmio size: 524288
[ 63.638261] [drm] add ip block number 0 <soc15_common>
[ 63.638263] [drm] add ip block number 1 <gmc_v9_0>
[ 63.638265] [drm] add ip block number 2 <vega20_ih>
[ 63.638266] [drm] add ip block number 3 <psp>
[ 63.638267] [drm] add ip block number 4 <powerplay>
[ 63.638269] [drm] add ip block number 5 <dm>
[ 63.638271] [drm] add ip block number 6 <gfx_v9_0>
[ 63.638272] [drm] add ip block number 7 <sdma_v4_0>
[ 63.638273] [drm] add ip block number 8 <uvd_v7_0>
[ 63.638275] [drm] add ip block number 9 <vce_v4_0>
[ 63.675838] amdgpu 0000:86:00.0: amdgpu: Fetched VBIOS from ROM BAR
[ 63.675842] amdgpu: ATOM BIOS: 113-D1630200-112
[ 63.675867] [drm] UVD(0) is enabled in VM mode
[ 63.675868] [drm] UVD(1) is enabled in VM mode
[ 63.675869] [drm] UVD(0) ENC is enabled in VM mode
[ 63.675870] [drm] UVD(1) ENC is enabled in VM mode
[ 63.675871] [drm] VCE enabled in VM mode
[ 63.675873] amdgpu 0000:86:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[ 63.675899] [drm] GPU posting now...
[ 63.676276] amdgpu 0000:86:00.0: amdgpu: MEM ECC is active.
[ 63.676277] amdgpu 0000:86:00.0: amdgpu: SRAM ECC is active.
[ 63.676286] amdgpu 0000:86:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7fff] ras_mask[7fff]
[ 63.676297] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 63.676304] amdgpu 0000:86:00.0: amdgpu: VRAM: 32752M 0x0000008800000000 - 0x0000008FFEFFFFFF (32752M used)
[ 63.676307] amdgpu 0000:86:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[ 63.676310] amdgpu 0000:86:00.0: amdgpu: AGP: 267845632M 0x0000009000000000 - 0x0000FFFFFFFFFFFF
[ 63.676321] [drm] Detected VRAM RAM=32752M, BAR=32768M
[ 63.676322] [drm] RAM width 4096bits HBM
[ 63.676363] [drm] amdgpu: 32752M of VRAM memory ready
[ 63.676365] [drm] amdgpu: 40203M of GTT memory ready.
[ 63.676388] [drm] GART: num cpu pages 131072, num gpu pages 131072
[ 63.676481] [drm] PCIE GART of 512M enabled.
[ 63.676482] [drm] PTB located at 0x0000008800000000
[ 63.676730] amdgpu 0000:86:00.0: amdgpu: PSP runtime database doesn't exist
[ 63.676733] amdgpu 0000:86:00.0: amdgpu: PSP runtime database doesn't exist
[ 63.677088] amdgpu: [powerplay] hwmgr_sw_init smu backed is vega20_smu
[ 63.678862] [drm] Found UVD firmware ENC: 1.2 DEC: .43 Family ID: 19
[ 63.678918] [drm] PSP loading UVD firmware
[ 63.679487] [drm] Found VCE firmware Version: 57.6 Binary ID: 4
[ 63.679619] [drm] PSP loading VCE firmware
[ 63.831730] [drm] reserve 0x400000 from 0x8ffec00000 for PSP TMR
[ 63.914508] amdgpu 0000:86:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available
[ 63.914513] amdgpu 0000:86:00.0: amdgpu: DTM: optional dtm ta ucode is not available
[ 63.914516] amdgpu 0000:86:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 63.914518] amdgpu 0000:86:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[ 63.917458] [drm] Display Core initialized with v3.2.196!
[ 63.919616] [drm] kiq ring mec 2 pipe 1 q 0
[ 63.961950] [drm] UVD and UVD ENC initialized successfully.
[ 64.160863] [drm] VCE initialized successfully.
[ 64.179285] amdgpu: [dbg_xgmi_hive_get] ref_count 4
[ 64.179291] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G OE 5.15.0-46-generic #49~20.04.1-Ubuntu
[ 64.179297] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
[ 64.179299] Workqueue: events work_for_cpu_fn
[ 64.179311] Call Trace:
[ 64.179315] <TASK>
[ 64.179320] dump_stack_lvl+0x4a/0x63
[ 64.179331] dump_stack+0x10/0x16
[ 64.179340] amdgpu_get_xgmi_hive+0x217/0x2a0 [amdgpu]
[ 64.179904] amdgpu_xgmi_add_device+0xcc/0x450 [amdgpu]
[ 64.180318] ? amdgpu_ras_recovery_init+0x13d/0x2e0 [amdgpu]
[ 64.180733] ? vce_v4_0_hw_init.cold+0xc/0x13 [amdgpu]
[ 64.181321] amdgpu_device_init.cold+0x15bd/0x1fe3 [amdgpu]
[ 64.181909] ? pci_bus_read_config_word+0x4a/0x70
[ 64.181917] ? do_pci_enable_device+0xdb/0x110
[ 64.181923] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
[ 64.182315] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
[ 64.182703] local_pci_probe+0x4b/0x90
[ 64.182710] work_for_cpu_fn+0x1a/0x30
[ 64.182715] process_one_work+0x22b/0x3d0
[ 64.182722] worker_thread+0x21d/0x3f0
[ 64.182728] ? process_one_work+0x3d0/0x3d0
[ 64.182734] kthread+0x12a/0x150
[ 64.182740] ? set_kthread_struct+0x50/0x50
[ 64.182745] ret_from_fork+0x22/0x30
[ 64.182756] </TASK>
[ 64.184561] amdgpu 0000:86:00.0: amdgpu: XGMI: Add node 1, hive 0x25bbae7e3fd04cf4.
[ 64.184568] amdgpu: [dbg_xgmi_hive_get] ref_count 5
[ 64.184571] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G OE 5.15.0-46-generic #49~20.04.1-Ubuntu
[ 64.184576] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
[ 64.184578] Workqueue: events work_for_cpu_fn
[ 64.184585] Call Trace:
[ 64.184587] <TASK>
[ 64.184589] dump_stack_lvl+0x4a/0x63
[ 64.184596] dump_stack+0x10/0x16
[ 64.184602] amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu]
[ 64.185041] amdgpu_device_init.cold+0x15cd/0x1fe3 [amdgpu]
[ 64.185624] ? pci_bus_read_config_word+0x4a/0x70
[ 64.185631] ? do_pci_enable_device+0xdb/0x110
[ 64.185636] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
[ 64.186027] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
[ 64.186416] local_pci_probe+0x4b/0x90
[ 64.186422] work_for_cpu_fn+0x1a/0x30
[ 64.186428] process_one_work+0x22b/0x3d0
[ 64.186434] worker_thread+0x21d/0x3f0
[ 64.186439] ? process_one_work+0x3d0/0x3d0
[ 64.186445] kthread+0x12a/0x150
[ 64.186450] ? set_kthread_struct+0x50/0x50
[ 64.186455] ret_from_fork+0x22/0x30
[ 64.186464] </TASK>
[ 64.206119] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[ 64.206433] amdgpu: sdma_bitmap: ffff
[ 64.552064] memmap_init_zone_device initialised 8388608 pages in 132ms
[ 64.552080] amdgpu: HMM registered 32752MB device memory
[ 64.552116] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled!
[ 64.552138] amdgpu: Virtual CRAT table created for GPU
[ 64.552978] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled!
[ 64.552988] amdgpu: Topology: Add dGPU node [0x66a1:0x1002]
[ 64.552999] kfd kfd: amdgpu: added device 1002:66a1
[ 64.570314] amdgpu 0000:86:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 16, active_cu_number 64
[ 64.570527] amdgpu 0000:86:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
[ 64.570531] amdgpu 0000:86:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 64.570535] amdgpu 0000:86:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 64.570538] amdgpu 0000:86:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[ 64.570541] amdgpu 0000:86:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[ 64.570544] amdgpu 0000:86:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[ 64.570547] amdgpu 0000:86:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[ 64.570550] amdgpu 0000:86:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[ 64.570552] amdgpu 0000:86:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[ 64.570556] amdgpu 0000:86:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[ 64.570559] amdgpu 0000:86:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
[ 64.570562] amdgpu 0000:86:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 1
[ 64.570565] amdgpu 0000:86:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 1
[ 64.570567] amdgpu 0000:86:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 1
[ 64.570570] amdgpu 0000:86:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 1
[ 64.570573] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
[ 64.570576] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
[ 64.570579] amdgpu 0000:86:00.0: amdgpu: ring uvd_1 uses VM inv eng 9 on hub 1
[ 64.570581] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_1.0 uses VM inv eng 10 on hub 1
[ 64.570584] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_1.1 uses VM inv eng 11 on hub 1
[ 64.570587] amdgpu 0000:86:00.0: amdgpu: ring vce0 uses VM inv eng 12 on hub 1
[ 64.570589] amdgpu 0000:86:00.0: amdgpu: ring vce1 uses VM inv eng 13 on hub 1
[ 64.570592] amdgpu 0000:86:00.0: amdgpu: ring vce2 uses VM inv eng 14 on hub 1
[ 64.581070] amdgpu: [dbg_xgmi_hive_get] ref_count 6
[ 64.581075] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G OE 5.15.0-46-generic #49~20.04.1-Ubuntu
[ 64.581079] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
[ 64.581081] Workqueue: events work_for_cpu_fn
[ 64.581089] Call Trace:
[ 64.581091] <TASK>
[ 64.581094] dump_stack_lvl+0x4a/0x63
[ 64.581103] dump_stack+0x10/0x16
[ 64.581109] amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu]
[ 64.581489] amdgpu_xgmi_set_pstate+0xe/0x30 [amdgpu]
[ 64.581723] amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu]
[ 64.581943] amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu]
[ 64.582288] ? pci_bus_read_config_word+0x4a/0x70
[ 64.582295] ? do_pci_enable_device+0xdb/0x110
[ 64.582298] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
[ 64.582520] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
[ 64.582738] local_pci_probe+0x4b/0x90
[ 64.582743] work_for_cpu_fn+0x1a/0x30
[ 64.582746] process_one_work+0x22b/0x3d0
[ 64.582750] worker_thread+0x21d/0x3f0
[ 64.582753] ? process_one_work+0x3d0/0x3d0
[ 64.582756] kthread+0x12a/0x150
[ 64.582761] ? set_kthread_struct+0x50/0x50
[ 64.582764] ret_from_fork+0x22/0x30
[ 64.582772] </TASK>
[ 64.582774] amdgpu: [dbg_xgmi_hive_put] ref_count 5
[ 64.582775] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G OE 5.15.0-46-generic #49~20.04.1-Ubuntu
[ 64.582778] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
[ 64.582779] Workqueue: events work_for_cpu_fn
[ 64.582782] Call Trace:
[ 64.582783] <TASK>
[ 64.582784] dump_stack_lvl+0x4a/0x63
[ 64.582789] dump_stack+0x10/0x16
[ 64.582792] amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
[ 64.583028] amdgpu_xgmi_set_pstate+0x1b/0x30 [amdgpu]
[ 64.583262] amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu]
[ 64.583482] amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu]
[ 64.583833] ? pci_bus_read_config_word+0x4a/0x70
[ 64.583836] ? do_pci_enable_device+0xdb/0x110
[ 64.583840] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
[ 64.584072] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
[ 64.584304] local_pci_probe+0x4b/0x90
[ 64.584307] work_for_cpu_fn+0x1a/0x30
[ 64.584311] process_one_work+0x22b/0x3d0
[ 64.584314] worker_thread+0x21d/0x3f0
[ 64.584318] ? process_one_work+0x3d0/0x3d0
[ 64.584321] kthread+0x12a/0x150
[ 64.584324] ? set_kthread_struct+0x50/0x50
[ 64.584327] ret_from_fork+0x22/0x30
[ 64.584333] </TASK>
[ 64.584342] amdgpu: [dbg_xgmi_hive_get] ref_count 6
[ 64.584344] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G OE 5.15.0-46-generic #49~20.04.1-Ubuntu
[ 64.584347] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
[ 64.584348] Workqueue: events work_for_cpu_fn
[ 64.584352] Call Trace:
[ 64.584353] <TASK>
[ 64.584354] dump_stack_lvl+0x4a/0x63
[ 64.584358] dump_stack+0x10/0x16
[ 64.584362] amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu]
[ 64.584610] amdgpu_xgmi_set_pstate+0xe/0x30 [amdgpu]
[ 64.584856] amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu]
[ 64.585086] amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu]
[ 64.585437] ? pci_bus_read_config_word+0x4a/0x70
[ 64.585440] ? do_pci_enable_device+0xdb/0x110
[ 64.585443] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
[ 64.585679] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
[ 64.585922] local_pci_probe+0x4b/0x90
[ 64.585926] work_for_cpu_fn+0x1a/0x30
[ 64.585929] process_one_work+0x22b/0x3d0
[ 64.585932] worker_thread+0x21d/0x3f0
[ 64.585936] ? process_one_work+0x3d0/0x3d0
[ 64.585939] kthread+0x12a/0x150
[ 64.585942] ? set_kthread_struct+0x50/0x50
[ 64.585945] ret_from_fork+0x22/0x30
[ 64.585950] </TASK>
[ 64.585951] amdgpu: [dbg_xgmi_hive_put] ref_count 5
[ 64.585953] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G OE 5.15.0-46-generic #49~20.04.1-Ubuntu
[ 64.585956] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
[ 64.585957] Workqueue: events work_for_cpu_fn
[ 64.585960] Call Trace:
[ 64.585961] <TASK>
[ 64.585963] dump_stack_lvl+0x4a/0x63
[ 64.585967] dump_stack+0x10/0x16
[ 64.585970] amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
[ 64.586213] amdgpu_xgmi_set_pstate+0x1b/0x30 [amdgpu]
[ 64.586458] amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu]
[ 64.586688] amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu]
[ 64.587037] ? pci_bus_read_config_word+0x4a/0x70
[ 64.587040] ? do_pci_enable_device+0xdb/0x110
[ 64.587043] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
[ 64.587277] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
[ 64.587509] local_pci_probe+0x4b/0x90
[ 64.587512] work_for_cpu_fn+0x1a/0x30
[ 64.587515] process_one_work+0x22b/0x3d0
[ 64.587519] worker_thread+0x21d/0x3f0
[ 64.587523] ? process_one_work+0x3d0/0x3d0
[ 64.587526] kthread+0x12a/0x150
[ 64.587529] ? set_kthread_struct+0x50/0x50
[ 64.587532] ret_from_fork+0x22/0x30
[ 64.587537] </TASK>
[ 64.587619] amdgpu: Detected AMDGPU DF Counters. # of Counters = 8.
[ 64.587663] amdgpu: Detected AMDGPU 2 Perf Events.
[ 64.588081] [drm] Initialized amdgpu 3.48.0 20150101 for 0000:86:00.0 on minor 2
Then driver unload (reference stuck at 2):
[ 110.117018] amdgpu 0000:86:00.0: amdgpu: amdgpu: finishing device.
[ 110.131638] [drm] free PSP TMR buffer
[ 110.420529] amdgpu: [dbg_xgmi_hive_put] ref_count 4
[ 110.420537] CPU: 27 PID: 1748 Comm: modprobe Tainted: G OE 5.15.0-46-generic #49~20.04.1-Ubuntu
[ 110.420545] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
[ 110.420548] Call Trace:
[ 110.420551] <TASK>
[ 110.420556] dump_stack_lvl+0x4a/0x63
[ 110.420569] dump_stack+0x10/0x16
[ 110.420578] amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
[ 110.421001] amdgpu_xgmi_remove_device+0x11d/0x1c0 [amdgpu]
[ 110.421380] amdgpu_device_fini_sw+0x63/0x4c0 [amdgpu]
[ 110.421724] amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[ 110.422070] drm_dev_release+0x28/0x50 [drm]
[ 110.422145] devm_drm_dev_init_release+0x38/0x60 [drm]
[ 110.422190] devm_action_release+0x15/0x20
[ 110.422198] release_nodes+0x40/0xb0
[ 110.422205] devres_release_all+0x9e/0xe0
[ 110.422212] device_release_driver_internal+0x117/0x1f0
[ 110.422218] driver_detach+0x4c/0xa0
[ 110.422222] bus_remove_driver+0x6c/0xf0
[ 110.422227] driver_unregister+0x31/0x50
[ 110.422231] pci_unregister_driver+0x40/0x90
[ 110.422238] amdgpu_exit+0x15/0x446 [amdgpu]
[ 110.422791] __x64_sys_delete_module+0x14e/0x260
[ 110.422801] ? do_syscall_64+0x69/0xc0
[ 110.422809] ? __x64_sys_read+0x1a/0x20
[ 110.422817] ? do_syscall_64+0x69/0xc0
[ 110.422821] ? ksys_read+0x67/0xf0
[ 110.422825] do_syscall_64+0x5c/0xc0
[ 110.422830] ? __x64_sys_read+0x1a/0x20
[ 110.422834] ? do_syscall_64+0x69/0xc0
[ 110.422839] ? syscall_exit_to_user_mode+0x27/0x50
[ 110.422846] ? __x64_sys_openat+0x20/0x30
[ 110.422853] ? do_syscall_64+0x69/0xc0
[ 110.422857] ? do_syscall_64+0x69/0xc0
[ 110.422862] ? irqentry_exit+0x1d/0x30
[ 110.422868] ? exc_page_fault+0x89/0x170
[ 110.422874] entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 110.422885] RIP: 0033:0x7f1576682a6b
[ 110.422892] Code: 73 01 c3 48 8b 0d 25 c4 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 c3 0c 00 f7 d8 64 89 01 48
[ 110.422897] RSP: 002b:00007ffcb96e0bf8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 110.422904] RAX: ffffffffffffffda RBX: 000056347ba57550 RCX: 00007f1576682a6b
[ 110.422908] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000056347ba575b8
[ 110.422911] RBP: 000056347ba57550 R08: 0000000000000000 R09: 0000000000000000
[ 110.422913] R10: 00007f15766feac0 R11: 0000000000000206 R12: 000056347ba575b8
[ 110.422916] R13: 0000000000000000 R14: 000056347ba575b8 R15: 000056347ba57550
[ 110.422921] </TASK>
[ 110.425941] [drm] amdgpu: ttm finalized
[ 110.489186] amdgpu 0000:83:00.0: amdgpu: amdgpu: finishing device.
[ 110.504025] [drm] free PSP TMR buffer
[ 110.762272] amdgpu: [dbg_xgmi_hive_put] ref_count 3
[ 110.762280] CPU: 27 PID: 1748 Comm: modprobe Tainted: G OE 5.15.0-46-generic #49~20.04.1-Ubuntu
[ 110.762288] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
[ 110.762290] Call Trace:
[ 110.762294] <TASK>
[ 110.762298] dump_stack_lvl+0x4a/0x63
[ 110.762313] dump_stack+0x10/0x16
[ 110.762319] amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
[ 110.762663] amdgpu_xgmi_remove_device+0x11d/0x1c0 [amdgpu]
[ 110.762965] amdgpu_device_fini_sw+0x63/0x4c0 [amdgpu]
[ 110.763231] amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[ 110.763519] drm_dev_release+0x28/0x50 [drm]
[ 110.763569] devm_drm_dev_init_release+0x38/0x60 [drm]
[ 110.763609] devm_action_release+0x15/0x20
[ 110.763617] release_nodes+0x40/0xb0
[ 110.763624] devres_release_all+0x9e/0xe0
[ 110.763631] device_release_driver_internal+0x117/0x1f0
[ 110.763636] driver_detach+0x4c/0xa0
[ 110.763640] bus_remove_driver+0x6c/0xf0
[ 110.763646] driver_unregister+0x31/0x50
[ 110.763650] pci_unregister_driver+0x40/0x90
[ 110.763657] amdgpu_exit+0x15/0x446 [amdgpu]
[ 110.764153] __x64_sys_delete_module+0x14e/0x260
[ 110.764164] ? do_syscall_64+0x69/0xc0
[ 110.764172] ? __x64_sys_read+0x1a/0x20
[ 110.764180] ? do_syscall_64+0x69/0xc0
[ 110.764184] ? ksys_read+0x67/0xf0
[ 110.764189] do_syscall_64+0x5c/0xc0
[ 110.764193] ? __x64_sys_read+0x1a/0x20
[ 110.764197] ? do_syscall_64+0x69/0xc0
[ 110.764202] ? syscall_exit_to_user_mode+0x27/0x50
[ 110.764209] ? __x64_sys_openat+0x20/0x30
[ 110.764217] ? do_syscall_64+0x69/0xc0
[ 110.764221] ? do_syscall_64+0x69/0xc0
[ 110.764226] ? irqentry_exit+0x1d/0x30
[ 110.764232] ? exc_page_fault+0x89/0x170
[ 110.764238] entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 110.764248] RIP: 0033:0x7f1576682a6b
[ 110.764255] Code: 73 01 c3 48 8b 0d 25 c4 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 c3 0c 00 f7 d8 64 89 01 48
[ 110.764260] RSP: 002b:00007ffcb96e0bf8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 110.764267] RAX: ffffffffffffffda RBX: 000056347ba57550 RCX: 00007f1576682a6b
[ 110.764270] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000056347ba575b8
[ 110.764273] RBP: 000056347ba57550 R08: 0000000000000000 R09: 0000000000000000
[ 110.764275] R10: 00007f15766feac0 R11: 0000000000000206 R12: 000056347ba575b8
[ 110.764278] R13: 0000000000000000 R14: 000056347ba575b8 R15: 000056347ba57550
[ 110.764283] </TASK>
[ 110.764326] amdgpu: [dbg_xgmi_hive_put] ref_count 2
[ 110.764329] CPU: 27 PID: 1748 Comm: modprobe Tainted: G OE 5.15.0-46-generic #49~20.04.1-Ubuntu
[ 110.764334] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
[ 110.764336] Call Trace:
[ 110.764337] <TASK>
[ 110.764339] dump_stack_lvl+0x4a/0x63
[ 110.764347] dump_stack+0x10/0x16
[ 110.764354] amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
[ 110.764624] amdgpu_xgmi_remove_device+0x1ad/0x1c0 [amdgpu]
[ 110.764791] amdgpu_device_fini_sw+0x63/0x4c0 [amdgpu]
[ 110.764937] amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[ 110.765085] drm_dev_release+0x28/0x50 [drm]
[ 110.765108] devm_drm_dev_init_release+0x38/0x60 [drm]
[ 110.765130] devm_action_release+0x15/0x20
[ 110.765134] release_nodes+0x40/0xb0
[ 110.765137] devres_release_all+0x9e/0xe0
[ 110.765141] device_release_driver_internal+0x117/0x1f0
[ 110.765144] driver_detach+0x4c/0xa0
[ 110.765146] bus_remove_driver+0x6c/0xf0
[ 110.765148] driver_unregister+0x31/0x50
[ 110.765150] pci_unregister_driver+0x40/0x90
[ 110.765154] amdgpu_exit+0x15/0x446 [amdgpu]
[ 110.765434] __x64_sys_delete_module+0x14e/0x260
[ 110.765438] ? do_syscall_64+0x69/0xc0
[ 110.765441] ? __x64_sys_read+0x1a/0x20
[ 110.765444] ? do_syscall_64+0x69/0xc0
[ 110.765446] ? ksys_read+0x67/0xf0
[ 110.765449] do_syscall_64+0x5c/0xc0
[ 110.765451] ? __x64_sys_read+0x1a/0x20
[ 110.765454] ? do_syscall_64+0x69/0xc0
[ 110.765456] ? syscall_exit_to_user_mode+0x27/0x50
[ 110.765460] ? __x64_sys_openat+0x20/0x30
[ 110.765464] ? do_syscall_64+0x69/0xc0
[ 110.765466] ? do_syscall_64+0x69/0xc0
[ 110.765469] ? irqentry_exit+0x1d/0x30
[ 110.765472] ? exc_page_fault+0x89/0x170
[ 110.765476] entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 110.765480] RIP: 0033:0x7f1576682a6b
[ 110.765482] Code: 73 01 c3 48 8b 0d 25 c4 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 c3 0c 00 f7 d8 64 89 01 48
[ 110.765485] RSP: 002b:00007ffcb96e0bf8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[ 110.765488] RAX: ffffffffffffffda RBX: 000056347ba57550 RCX: 00007f1576682a6b
[ 110.765489] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000056347ba575b8
[ 110.765491] RBP: 000056347ba57550 R08: 0000000000000000 R09: 0000000000000000
[ 110.765492] R10: 00007f15766feac0 R11: 0000000000000206 R12: 000056347ba575b8
[ 110.765494] R13: 0000000000000000 R14: 000056347ba575b8 R15: 000056347ba57550
[ 110.765496] </TASK>
[ 110.768091] [drm] amdgpu: ttm finalized
> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
> Sent: August 11, 2022 12:43 PM
> To: Kim, Jonathan <Jonathan.Kim at amd.com>; Kuehling, Felix
> <Felix.Kuehling at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info reference
> leak
>
>
> On 2022-08-11 11:34, Kim, Jonathan wrote:
> > [Public]
> >
> >> -----Original Message-----
> >> From: Kuehling, Felix <Felix.Kuehling at amd.com>
> >> Sent: August 11, 2022 11:19 AM
> >> To: amd-gfx at lists.freedesktop.org; Kim, Jonathan
> <Jonathan.Kim at amd.com>
> >> Subject: Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info reference
> >> leak
> >>
> >> Am 2022-08-11 um 09:42 schrieb Jonathan Kim:
> >>> When an xgmi node is added to the hive, it takes another hive
> >>> reference for its reset domain.
> >>>
> >>> This extra reference was not dropped on device removal from the
> >>> hive so drop it.
> >>>
> >>> Signed-off-by: Jonathan Kim <jonathan.kim at amd.com>
> >>> ---
> >>> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 3 +++
> >>> 1 file changed, 3 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> >>> index 1b108d03e785..560bf1c98f08 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> >>> @@ -731,6 +731,9 @@ int amdgpu_xgmi_remove_device(struct
> >> amdgpu_device *adev)
> >>> mutex_unlock(&hive->hive_lock);
> >>>
> >>> amdgpu_put_xgmi_hive(hive);
> >>> + /* device is removed from the hive so remove its reset domain
> >> reference */
> >>> + if (adev->reset_domain && adev->reset_domain == hive-
> >>> reset_domain)
> >>> + amdgpu_put_xgmi_hive(hive);
> >> This is some messed up reference counting. If you need an extra
> >> reference from the reset_domain to the hive, that should be owned by the
> >> reset_domain and dropped when the reset_domain is destroyed. And it's
> >> only one reference for the reset_domain, not one reference per adev in
> >> the reset_domain.
> > Cc'ing Andrey.
> >
> > What you're saying seems to make more sense to me, but what I got from an
> offline conversation with Andrey
> > was that the reset domain reference per device was intentional.
> > Maybe Andrey can comment here.
> >
> >> What you're doing here looks like every adev that's in a reset_domain of
> >> its hive has two references to the hive. And if you're dropping the
> >> extra reference here, it still leaves the reset_domain with a dangling
> >> pointer to a hive that may no longer exist. So this extra reference is
> >> kind of pointless.
>
>
> reset_domain doesn't have any references to the hive, the hive has a
> reference to reset_domain
>
>
> > Yes. Currently one reference is fetched from the device's lifetime on the hive
> and the other is from the
> > per-device reset domain.
> >
> > Snippet from amdgpu_device_ip_init:
> > /**
> > * In case of XGMI grab extra reference for reset domain for this device
> > */
> > if (adev->gmc.xgmi.num_physical_nodes > 1) {
> > if (amdgpu_xgmi_add_device(adev) == 0) { <- [JK] reference is
> fetched here
>
>
> amdgpu_xgmi_add_device calls amdgpu_get_xgmi_hive and only on the first
> time amdgpu_get_xgmi_hive is called and hive is actually allocated and
> initialized will we proceed
> to creating the reset domain either from scratch (first creation of the
> hive) or by taking reference from adev (see [1])
>
>
>
> [1] -
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/a
> mdgpu_xgmi.c#L394
>
> > struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev);
> <- [JK] then here again
>
>
> So here I don't see how an extra reference to reset_domain is taken if
> amdgpu_get_xgmi_hive returns early since the hive already created and
> exists in the global hive container ?
>
> Johantan - can u please show the exact flow how recount leak on
> reset_domain is happening ?
>
> Andrey
>
>
> >
> > if (!hive->reset_domain ||
> > !amdgpu_reset_get_reset_domain(hive->reset_domain)) {
> > r = -ENOENT;
> > goto init_failed;
> > }
> >
> > /* Drop the early temporary reset domain we created for device
> */
> > amdgpu_reset_put_reset_domain(adev->reset_domain);
> > adev->reset_domain = hive->reset_domain;
> > }
> > }
> >
> > One of these never gets dropped so a leak happens.
> > So either the extra reference has to be dropped on device removal from the
> hive or from what you've mentioned,
> > the reset_domain reference fetch should be fixed to grab at the
> hive/reset_domain level.
> >
> > Thanks,
> >
> > Jon
> >
> >> Regards,
> >> Felix
> >>
> >>
> >>> adev->hive = NULL;
> >>>
> >>> if (atomic_dec_return(&hive->number_devices) == 0) {
More information about the amd-gfx
mailing list