[PATCH] drm/amdgpu: fix reset domain xgmi hive info reference leak

Andrey Grodzovsky andrey.grodzovsky at amd.com
Fri Aug 12 22:05:12 UTC 2022


On 2022-08-12 14:38, Kim, Jonathan wrote:
> [Public]
>
> Hi Andrey,
>
> Here's the load/unload stack trace.  This is a 2 GPU xGMI system.  I put dbg_xgmi_hive_get/put refcount print post kobj get/put.
> It's stuck at 2 on unload.  If it's an 8 GPU system, it's stuck at 8.
>
> e.g. of sysfs leak after driver unload:
> atitest at atitest:/sys/devices/pci0000:80/0000:80:02.0/0000:81:00.0/0000:82:00.0/0000:83:00.0$ ls xgmi_hive_info/
> xgmi_hive_id
>
> Thanks,
>
> Jon


I see the leak, but how is it related to amdgpu_reset_domain ? How you 
think that he causing this ?

Andrey


>
>
> Driver load (get ref happens on both device add to hive and init per device):
> [   61.975900] amdkcl: loading out-of-tree module taints kernel.
> [   61.975973] amdkcl: module verification failed: signature and/or required key missing - tainting kernel
> [   62.065546] amdkcl: Warning: fail to get symbol cancel_work, replace it with kcl stub
> [   62.081920] AMD-Vi: AMD IOMMUv2 functionality not available on this system - This is not a bug.
> [   62.491119] [drm] amdgpu kernel modesetting enabled.
> [   62.491122] [drm] amdgpu version: 5.18.2
> [   62.491124] [drm] OS DRM version: 5.15.0
> [   62.491337] amdgpu: CRAT table not found
> [   62.491341] amdgpu: Virtual CRAT table created for CPU
> [   62.491360] amdgpu: Topology: Add CPU node
> [   62.603556] amdgpu: PeerDirect support was initialized successfully
> [   62.603847] amdgpu 0000:83:00.0: enabling device (0100 -> 0102)
> [   62.603987] [drm] initializing kernel modesetting (VEGA20 0x1002:0x66A1 0x1002:0x0834 0x00).
> [   62.604023] [drm] register mmio base: 0xFBD00000
> [   62.604026] [drm] register mmio size: 524288
> [   62.604171] [drm] add ip block number 0 <soc15_common>
> [   62.604175] [drm] add ip block number 1 <gmc_v9_0>
> [   62.604177] [drm] add ip block number 2 <vega20_ih>
> [   62.604180] [drm] add ip block number 3 <psp>
> [   62.604182] [drm] add ip block number 4 <powerplay>
> [   62.604185] [drm] add ip block number 5 <dm>
> [   62.604187] [drm] add ip block number 6 <gfx_v9_0>
> [   62.604190] [drm] add ip block number 7 <sdma_v4_0>
> [   62.604192] [drm] add ip block number 8 <uvd_v7_0>
> [   62.604194] [drm] add ip block number 9 <vce_v4_0>
> [   62.641771] amdgpu 0000:83:00.0: amdgpu: Fetched VBIOS from ROM BAR
> [   62.641777] amdgpu: ATOM BIOS: 113-D1630200-112
> [   62.713418] [drm] UVD(0) is enabled in VM mode
> [   62.713423] [drm] UVD(1) is enabled in VM mode
> [   62.713426] [drm] UVD(0) ENC is enabled in VM mode
> [   62.713428] [drm] UVD(1) ENC is enabled in VM mode
> [   62.713430] [drm] VCE enabled in VM mode
> [   62.713433] amdgpu 0000:83:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
> [   62.713472] [drm] GPU posting now...
> [   62.713993] amdgpu 0000:83:00.0: amdgpu: MEM ECC is active.
> [   62.713995] amdgpu 0000:83:00.0: amdgpu: SRAM ECC is active.
> [   62.714006] amdgpu 0000:83:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7fff] ras_mask[7fff]
> [   62.714018] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
> [   62.714026] amdgpu 0000:83:00.0: amdgpu: VRAM: 32752M 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
> [   62.714029] amdgpu 0000:83:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
> [   62.714032] amdgpu 0000:83:00.0: amdgpu: AGP: 267845632M 0x0000009000000000 - 0x0000FFFFFFFFFFFF
> [   62.714043] [drm] Detected VRAM RAM=32752M, BAR=32768M
> [   62.714044] [drm] RAM width 4096bits HBM
> [   62.714050] debugfs: Directory 'ttm' with parent '/' already present!
> [   62.714146] [drm] amdgpu: 32752M of VRAM memory ready
> [   62.714149] [drm] amdgpu: 40203M of GTT memory ready.
> [   62.714170] [drm] GART: num cpu pages 131072, num gpu pages 131072
> [   62.714266] [drm] PCIE GART of 512M enabled.
> [   62.714267] [drm] PTB located at 0x0000008000000000
> [   62.731067] amdgpu 0000:83:00.0: amdgpu: PSP runtime database doesn't exist
> [   62.731075] amdgpu 0000:83:00.0: amdgpu: PSP runtime database doesn't exist
> [   62.731449] amdgpu: [powerplay] hwmgr_sw_init smu backed is vega20_smu
> [   62.743177] [drm] Found UVD firmware ENC: 1.2 DEC: .43 Family ID: 19
> [   62.743244] [drm] PSP loading UVD firmware
> [   62.744525] [drm] Found VCE firmware Version: 57.6 Binary ID: 4
> [   62.744689] [drm] PSP loading VCE firmware
> [   62.896804] [drm] reserve 0x400000 from 0x87fec00000 for PSP TMR
> [   62.979421] amdgpu 0000:83:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available
> [   62.979427] amdgpu 0000:83:00.0: amdgpu: DTM: optional dtm ta ucode is not available
> [   62.979430] amdgpu 0000:83:00.0: amdgpu: RAP: optional rap ta ucode is not available
> [   62.979432] amdgpu 0000:83:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
> [   62.982386] [drm] Display Core initialized with v3.2.196!
> [   62.984514] [drm] kiq ring mec 2 pipe 1 q 0
> [   63.026846] [drm] UVD and UVD ENC initialized successfully.
> [   63.225760] [drm] VCE initialized successfully.
> [   63.244442] amdgpu: [dbg_xgmi_hive_get] ref_count 2
> [   63.244448] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
> [   63.244454] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
> [   63.244457] Workqueue: events work_for_cpu_fn
> [   63.244471] Call Trace:
> [   63.244474]  <TASK>
> [   63.244479]  dump_stack_lvl+0x4a/0x63
> [   63.244493]  dump_stack+0x10/0x16
> [   63.244501]  amdgpu_get_xgmi_hive+0x217/0x2a0 [amdgpu]
> [   63.245047]  amdgpu_xgmi_add_device+0xcc/0x450 [amdgpu]
> [   63.245463]  ? amdgpu_ras_recovery_init+0x13d/0x2e0 [amdgpu]
> [   63.245879]  ? vce_v4_0_hw_init.cold+0xc/0x13 [amdgpu]
> [   63.246466]  amdgpu_device_init.cold+0x15bd/0x1fe3 [amdgpu]
> [   63.247055]  ? pci_bus_read_config_word+0x4a/0x70
> [   63.247064]  ? do_pci_enable_device+0xdb/0x110
> [   63.247070]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
> [   63.247463]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
> [   63.247868]  local_pci_probe+0x4b/0x90
> [   63.247876]  work_for_cpu_fn+0x1a/0x30
> [   63.247881]  process_one_work+0x22b/0x3d0
> [   63.247887]  worker_thread+0x21d/0x3f0
> [   63.247893]  ? process_one_work+0x3d0/0x3d0
> [   63.247898]  kthread+0x12a/0x150
> [   63.247905]  ? set_kthread_struct+0x50/0x50
> [   63.247910]  ret_from_fork+0x22/0x30
> [   63.247922]  </TASK>
> [   63.248563] amdgpu 0000:83:00.0: amdgpu: XGMI: Add node 0, hive 0x25bbae7e3fd04cf4.
> [   63.248569] amdgpu: [dbg_xgmi_hive_get] ref_count 3
> [   63.248572] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
> [   63.248578] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
> [   63.248580] Workqueue: events work_for_cpu_fn
> [   63.248587] Call Trace:
> [   63.248588]  <TASK>
> [   63.248590]  dump_stack_lvl+0x4a/0x63
> [   63.248598]  dump_stack+0x10/0x16
> [   63.248604]  amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu]
> [   63.249033]  amdgpu_device_init.cold+0x15cd/0x1fe3 [amdgpu]
> [   63.249621]  ? pci_bus_read_config_word+0x4a/0x70
> [   63.249627]  ? do_pci_enable_device+0xdb/0x110
> [   63.249632]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
> [   63.250022]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
> [   63.250410]  local_pci_probe+0x4b/0x90
> [   63.250416]  work_for_cpu_fn+0x1a/0x30
> [   63.250421]  process_one_work+0x22b/0x3d0
> [   63.250428]  worker_thread+0x21d/0x3f0
> [   63.250434]  ? process_one_work+0x3d0/0x3d0
> [   63.250440]  kthread+0x12a/0x150
> [   63.250445]  ? set_kthread_struct+0x50/0x50
> [   63.250450]  ret_from_fork+0x22/0x30
> [   63.250458]  </TASK>
> [   63.268869] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
> [   63.269180] amdgpu: sdma_bitmap: ffff
> [   63.605188] memmap_init_zone_device initialised 8388608 pages in 132ms
> [   63.605203] amdgpu: HMM registered 32752MB device memory
> [   63.605244] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled!
>
> [   63.605263] amdgpu: Virtual CRAT table created for GPU
> [   63.605651] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled!
>
> [   63.605659] amdgpu: Topology: Add dGPU node [0x66a1:0x1002]
> [   63.605670] kfd kfd: amdgpu: added device 1002:66a1
> [   63.626300] amdgpu 0000:83:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 16, active_cu_number 64
> [   63.626517] amdgpu 0000:83:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
> [   63.626522] amdgpu 0000:83:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
> [   63.626525] amdgpu 0000:83:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
> [   63.626529] amdgpu 0000:83:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
> [   63.626531] amdgpu 0000:83:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
> [   63.626534] amdgpu 0000:83:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
> [   63.626537] amdgpu 0000:83:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
> [   63.626540] amdgpu 0000:83:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
> [   63.626543] amdgpu 0000:83:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
> [   63.626546] amdgpu 0000:83:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
> [   63.626549] amdgpu 0000:83:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
> [   63.626552] amdgpu 0000:83:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 1
> [   63.626555] amdgpu 0000:83:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 1
> [   63.626558] amdgpu 0000:83:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 1
> [   63.626561] amdgpu 0000:83:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 1
> [   63.626563] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
> [   63.626566] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
> [   63.626569] amdgpu 0000:83:00.0: amdgpu: ring uvd_1 uses VM inv eng 9 on hub 1
> [   63.626572] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_1.0 uses VM inv eng 10 on hub 1
> [   63.626575] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_1.1 uses VM inv eng 11 on hub 1
> [   63.626577] amdgpu 0000:83:00.0: amdgpu: ring vce0 uses VM inv eng 12 on hub 1
> [   63.626580] amdgpu 0000:83:00.0: amdgpu: ring vce1 uses VM inv eng 13 on hub 1
> [   63.626583] amdgpu 0000:83:00.0: amdgpu: ring vce2 uses VM inv eng 14 on hub 1
> [   63.636996] amdgpu: Detected AMDGPU DF Counters. # of Counters = 8.
> [   63.637046] amdgpu: Detected AMDGPU 2 Perf Events.
> [   63.637428] [drm] Initialized amdgpu 3.48.0 20150101 for 0000:83:00.0 on minor 1
> [   63.637937] amdgpu 0000:86:00.0: enabling device (0100 -> 0102)
> [   63.638043] [drm] initializing kernel modesetting (VEGA20 0x1002:0x66A1 0x1002:0x0834 0x00).
> [   63.638090] [drm] register mmio base: 0xFBB00000
> [   63.638092] [drm] register mmio size: 524288
> [   63.638261] [drm] add ip block number 0 <soc15_common>
> [   63.638263] [drm] add ip block number 1 <gmc_v9_0>
> [   63.638265] [drm] add ip block number 2 <vega20_ih>
> [   63.638266] [drm] add ip block number 3 <psp>
> [   63.638267] [drm] add ip block number 4 <powerplay>
> [   63.638269] [drm] add ip block number 5 <dm>
> [   63.638271] [drm] add ip block number 6 <gfx_v9_0>
> [   63.638272] [drm] add ip block number 7 <sdma_v4_0>
> [   63.638273] [drm] add ip block number 8 <uvd_v7_0>
> [   63.638275] [drm] add ip block number 9 <vce_v4_0>
> [   63.675838] amdgpu 0000:86:00.0: amdgpu: Fetched VBIOS from ROM BAR
> [   63.675842] amdgpu: ATOM BIOS: 113-D1630200-112
> [   63.675867] [drm] UVD(0) is enabled in VM mode
> [   63.675868] [drm] UVD(1) is enabled in VM mode
> [   63.675869] [drm] UVD(0) ENC is enabled in VM mode
> [   63.675870] [drm] UVD(1) ENC is enabled in VM mode
> [   63.675871] [drm] VCE enabled in VM mode
> [   63.675873] amdgpu 0000:86:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
> [   63.675899] [drm] GPU posting now...
> [   63.676276] amdgpu 0000:86:00.0: amdgpu: MEM ECC is active.
> [   63.676277] amdgpu 0000:86:00.0: amdgpu: SRAM ECC is active.
> [   63.676286] amdgpu 0000:86:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7fff] ras_mask[7fff]
> [   63.676297] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
> [   63.676304] amdgpu 0000:86:00.0: amdgpu: VRAM: 32752M 0x0000008800000000 - 0x0000008FFEFFFFFF (32752M used)
> [   63.676307] amdgpu 0000:86:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
> [   63.676310] amdgpu 0000:86:00.0: amdgpu: AGP: 267845632M 0x0000009000000000 - 0x0000FFFFFFFFFFFF
> [   63.676321] [drm] Detected VRAM RAM=32752M, BAR=32768M
> [   63.676322] [drm] RAM width 4096bits HBM
> [   63.676363] [drm] amdgpu: 32752M of VRAM memory ready
> [   63.676365] [drm] amdgpu: 40203M of GTT memory ready.
> [   63.676388] [drm] GART: num cpu pages 131072, num gpu pages 131072
> [   63.676481] [drm] PCIE GART of 512M enabled.
> [   63.676482] [drm] PTB located at 0x0000008800000000
> [   63.676730] amdgpu 0000:86:00.0: amdgpu: PSP runtime database doesn't exist
> [   63.676733] amdgpu 0000:86:00.0: amdgpu: PSP runtime database doesn't exist
> [   63.677088] amdgpu: [powerplay] hwmgr_sw_init smu backed is vega20_smu
> [   63.678862] [drm] Found UVD firmware ENC: 1.2 DEC: .43 Family ID: 19
> [   63.678918] [drm] PSP loading UVD firmware
> [   63.679487] [drm] Found VCE firmware Version: 57.6 Binary ID: 4
> [   63.679619] [drm] PSP loading VCE firmware
> [   63.831730] [drm] reserve 0x400000 from 0x8ffec00000 for PSP TMR
> [   63.914508] amdgpu 0000:86:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available
> [   63.914513] amdgpu 0000:86:00.0: amdgpu: DTM: optional dtm ta ucode is not available
> [   63.914516] amdgpu 0000:86:00.0: amdgpu: RAP: optional rap ta ucode is not available
> [   63.914518] amdgpu 0000:86:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
> [   63.917458] [drm] Display Core initialized with v3.2.196!
> [   63.919616] [drm] kiq ring mec 2 pipe 1 q 0
> [   63.961950] [drm] UVD and UVD ENC initialized successfully.
> [   64.160863] [drm] VCE initialized successfully.
> [   64.179285] amdgpu: [dbg_xgmi_hive_get] ref_count 4
> [   64.179291] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
> [   64.179297] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
> [   64.179299] Workqueue: events work_for_cpu_fn
> [   64.179311] Call Trace:
> [   64.179315]  <TASK>
> [   64.179320]  dump_stack_lvl+0x4a/0x63
> [   64.179331]  dump_stack+0x10/0x16
> [   64.179340]  amdgpu_get_xgmi_hive+0x217/0x2a0 [amdgpu]
> [   64.179904]  amdgpu_xgmi_add_device+0xcc/0x450 [amdgpu]
> [   64.180318]  ? amdgpu_ras_recovery_init+0x13d/0x2e0 [amdgpu]
> [   64.180733]  ? vce_v4_0_hw_init.cold+0xc/0x13 [amdgpu]
> [   64.181321]  amdgpu_device_init.cold+0x15bd/0x1fe3 [amdgpu]
> [   64.181909]  ? pci_bus_read_config_word+0x4a/0x70
> [   64.181917]  ? do_pci_enable_device+0xdb/0x110
> [   64.181923]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
> [   64.182315]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
> [   64.182703]  local_pci_probe+0x4b/0x90
> [   64.182710]  work_for_cpu_fn+0x1a/0x30
> [   64.182715]  process_one_work+0x22b/0x3d0
> [   64.182722]  worker_thread+0x21d/0x3f0
> [   64.182728]  ? process_one_work+0x3d0/0x3d0
> [   64.182734]  kthread+0x12a/0x150
> [   64.182740]  ? set_kthread_struct+0x50/0x50
> [   64.182745]  ret_from_fork+0x22/0x30
> [   64.182756]  </TASK>
> [   64.184561] amdgpu 0000:86:00.0: amdgpu: XGMI: Add node 1, hive 0x25bbae7e3fd04cf4.
> [   64.184568] amdgpu: [dbg_xgmi_hive_get] ref_count 5
> [   64.184571] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
> [   64.184576] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
> [   64.184578] Workqueue: events work_for_cpu_fn
> [   64.184585] Call Trace:
> [   64.184587]  <TASK>
> [   64.184589]  dump_stack_lvl+0x4a/0x63
> [   64.184596]  dump_stack+0x10/0x16
> [   64.184602]  amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu]
> [   64.185041]  amdgpu_device_init.cold+0x15cd/0x1fe3 [amdgpu]
> [   64.185624]  ? pci_bus_read_config_word+0x4a/0x70
> [   64.185631]  ? do_pci_enable_device+0xdb/0x110
> [   64.185636]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
> [   64.186027]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
> [   64.186416]  local_pci_probe+0x4b/0x90
> [   64.186422]  work_for_cpu_fn+0x1a/0x30
> [   64.186428]  process_one_work+0x22b/0x3d0
> [   64.186434]  worker_thread+0x21d/0x3f0
> [   64.186439]  ? process_one_work+0x3d0/0x3d0
> [   64.186445]  kthread+0x12a/0x150
> [   64.186450]  ? set_kthread_struct+0x50/0x50
> [   64.186455]  ret_from_fork+0x22/0x30
> [   64.186464]  </TASK>
> [   64.206119] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
> [   64.206433] amdgpu: sdma_bitmap: ffff
> [   64.552064] memmap_init_zone_device initialised 8388608 pages in 132ms
> [   64.552080] amdgpu: HMM registered 32752MB device memory
> [   64.552116] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled!
>
> [   64.552138] amdgpu: Virtual CRAT table created for GPU
> [   64.552978] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled!
>
> [   64.552988] amdgpu: Topology: Add dGPU node [0x66a1:0x1002]
> [   64.552999] kfd kfd: amdgpu: added device 1002:66a1
> [   64.570314] amdgpu 0000:86:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 16, active_cu_number 64
> [   64.570527] amdgpu 0000:86:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
> [   64.570531] amdgpu 0000:86:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
> [   64.570535] amdgpu 0000:86:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
> [   64.570538] amdgpu 0000:86:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
> [   64.570541] amdgpu 0000:86:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
> [   64.570544] amdgpu 0000:86:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
> [   64.570547] amdgpu 0000:86:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
> [   64.570550] amdgpu 0000:86:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
> [   64.570552] amdgpu 0000:86:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
> [   64.570556] amdgpu 0000:86:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
> [   64.570559] amdgpu 0000:86:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
> [   64.570562] amdgpu 0000:86:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 1
> [   64.570565] amdgpu 0000:86:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 1
> [   64.570567] amdgpu 0000:86:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 1
> [   64.570570] amdgpu 0000:86:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 1
> [   64.570573] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
> [   64.570576] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
> [   64.570579] amdgpu 0000:86:00.0: amdgpu: ring uvd_1 uses VM inv eng 9 on hub 1
> [   64.570581] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_1.0 uses VM inv eng 10 on hub 1
> [   64.570584] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_1.1 uses VM inv eng 11 on hub 1
> [   64.570587] amdgpu 0000:86:00.0: amdgpu: ring vce0 uses VM inv eng 12 on hub 1
> [   64.570589] amdgpu 0000:86:00.0: amdgpu: ring vce1 uses VM inv eng 13 on hub 1
> [   64.570592] amdgpu 0000:86:00.0: amdgpu: ring vce2 uses VM inv eng 14 on hub 1
> [   64.581070] amdgpu: [dbg_xgmi_hive_get] ref_count 6
> [   64.581075] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
> [   64.581079] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
> [   64.581081] Workqueue: events work_for_cpu_fn
> [   64.581089] Call Trace:
> [   64.581091]  <TASK>
> [   64.581094]  dump_stack_lvl+0x4a/0x63
> [   64.581103]  dump_stack+0x10/0x16
> [   64.581109]  amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu]
> [   64.581489]  amdgpu_xgmi_set_pstate+0xe/0x30 [amdgpu]
> [   64.581723]  amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu]
> [   64.581943]  amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu]
> [   64.582288]  ? pci_bus_read_config_word+0x4a/0x70
> [   64.582295]  ? do_pci_enable_device+0xdb/0x110
> [   64.582298]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
> [   64.582520]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
> [   64.582738]  local_pci_probe+0x4b/0x90
> [   64.582743]  work_for_cpu_fn+0x1a/0x30
> [   64.582746]  process_one_work+0x22b/0x3d0
> [   64.582750]  worker_thread+0x21d/0x3f0
> [   64.582753]  ? process_one_work+0x3d0/0x3d0
> [   64.582756]  kthread+0x12a/0x150
> [   64.582761]  ? set_kthread_struct+0x50/0x50
> [   64.582764]  ret_from_fork+0x22/0x30
> [   64.582772]  </TASK>
> [   64.582774] amdgpu: [dbg_xgmi_hive_put] ref_count 5
> [   64.582775] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
> [   64.582778] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
> [   64.582779] Workqueue: events work_for_cpu_fn
> [   64.582782] Call Trace:
> [   64.582783]  <TASK>
> [   64.582784]  dump_stack_lvl+0x4a/0x63
> [   64.582789]  dump_stack+0x10/0x16
> [   64.582792]  amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
> [   64.583028]  amdgpu_xgmi_set_pstate+0x1b/0x30 [amdgpu]
> [   64.583262]  amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu]
> [   64.583482]  amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu]
> [   64.583833]  ? pci_bus_read_config_word+0x4a/0x70
> [   64.583836]  ? do_pci_enable_device+0xdb/0x110
> [   64.583840]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
> [   64.584072]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
> [   64.584304]  local_pci_probe+0x4b/0x90
> [   64.584307]  work_for_cpu_fn+0x1a/0x30
> [   64.584311]  process_one_work+0x22b/0x3d0
> [   64.584314]  worker_thread+0x21d/0x3f0
> [   64.584318]  ? process_one_work+0x3d0/0x3d0
> [   64.584321]  kthread+0x12a/0x150
> [   64.584324]  ? set_kthread_struct+0x50/0x50
> [   64.584327]  ret_from_fork+0x22/0x30
> [   64.584333]  </TASK>
> [   64.584342] amdgpu: [dbg_xgmi_hive_get] ref_count 6
> [   64.584344] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
> [   64.584347] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
> [   64.584348] Workqueue: events work_for_cpu_fn
> [   64.584352] Call Trace:
> [   64.584353]  <TASK>
> [   64.584354]  dump_stack_lvl+0x4a/0x63
> [   64.584358]  dump_stack+0x10/0x16
> [   64.584362]  amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu]
> [   64.584610]  amdgpu_xgmi_set_pstate+0xe/0x30 [amdgpu]
> [   64.584856]  amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu]
> [   64.585086]  amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu]
> [   64.585437]  ? pci_bus_read_config_word+0x4a/0x70
> [   64.585440]  ? do_pci_enable_device+0xdb/0x110
> [   64.585443]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
> [   64.585679]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
> [   64.585922]  local_pci_probe+0x4b/0x90
> [   64.585926]  work_for_cpu_fn+0x1a/0x30
> [   64.585929]  process_one_work+0x22b/0x3d0
> [   64.585932]  worker_thread+0x21d/0x3f0
> [   64.585936]  ? process_one_work+0x3d0/0x3d0
> [   64.585939]  kthread+0x12a/0x150
> [   64.585942]  ? set_kthread_struct+0x50/0x50
> [   64.585945]  ret_from_fork+0x22/0x30
> [   64.585950]  </TASK>
> [   64.585951] amdgpu: [dbg_xgmi_hive_put] ref_count 5
> [   64.585953] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
> [   64.585956] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
> [   64.585957] Workqueue: events work_for_cpu_fn
> [   64.585960] Call Trace:
> [   64.585961]  <TASK>
> [   64.585963]  dump_stack_lvl+0x4a/0x63
> [   64.585967]  dump_stack+0x10/0x16
> [   64.585970]  amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
> [   64.586213]  amdgpu_xgmi_set_pstate+0x1b/0x30 [amdgpu]
> [   64.586458]  amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu]
> [   64.586688]  amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu]
> [   64.587037]  ? pci_bus_read_config_word+0x4a/0x70
> [   64.587040]  ? do_pci_enable_device+0xdb/0x110
> [   64.587043]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
> [   64.587277]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
> [   64.587509]  local_pci_probe+0x4b/0x90
> [   64.587512]  work_for_cpu_fn+0x1a/0x30
> [   64.587515]  process_one_work+0x22b/0x3d0
> [   64.587519]  worker_thread+0x21d/0x3f0
> [   64.587523]  ? process_one_work+0x3d0/0x3d0
> [   64.587526]  kthread+0x12a/0x150
> [   64.587529]  ? set_kthread_struct+0x50/0x50
> [   64.587532]  ret_from_fork+0x22/0x30
> [   64.587537]  </TASK>
> [   64.587619] amdgpu: Detected AMDGPU DF Counters. # of Counters = 8.
> [   64.587663] amdgpu: Detected AMDGPU 2 Perf Events.
> [   64.588081] [drm] Initialized amdgpu 3.48.0 20150101 for 0000:86:00.0 on minor 2
>
> Then driver unload (reference stuck at 2):
> [  110.117018] amdgpu 0000:86:00.0: amdgpu: amdgpu: finishing device.
> [  110.131638] [drm] free PSP TMR buffer
> [  110.420529] amdgpu: [dbg_xgmi_hive_put] ref_count 4
> [  110.420537] CPU: 27 PID: 1748 Comm: modprobe Tainted: G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
> [  110.420545] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
> [  110.420548] Call Trace:
> [  110.420551]  <TASK>
> [  110.420556]  dump_stack_lvl+0x4a/0x63
> [  110.420569]  dump_stack+0x10/0x16
> [  110.420578]  amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
> [  110.421001]  amdgpu_xgmi_remove_device+0x11d/0x1c0 [amdgpu]
> [  110.421380]  amdgpu_device_fini_sw+0x63/0x4c0 [amdgpu]
> [  110.421724]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
> [  110.422070]  drm_dev_release+0x28/0x50 [drm]
> [  110.422145]  devm_drm_dev_init_release+0x38/0x60 [drm]
> [  110.422190]  devm_action_release+0x15/0x20
> [  110.422198]  release_nodes+0x40/0xb0
> [  110.422205]  devres_release_all+0x9e/0xe0
> [  110.422212]  device_release_driver_internal+0x117/0x1f0
> [  110.422218]  driver_detach+0x4c/0xa0
> [  110.422222]  bus_remove_driver+0x6c/0xf0
> [  110.422227]  driver_unregister+0x31/0x50
> [  110.422231]  pci_unregister_driver+0x40/0x90
> [  110.422238]  amdgpu_exit+0x15/0x446 [amdgpu]
> [  110.422791]  __x64_sys_delete_module+0x14e/0x260
> [  110.422801]  ? do_syscall_64+0x69/0xc0
> [  110.422809]  ? __x64_sys_read+0x1a/0x20
> [  110.422817]  ? do_syscall_64+0x69/0xc0
> [  110.422821]  ? ksys_read+0x67/0xf0
> [  110.422825]  do_syscall_64+0x5c/0xc0
> [  110.422830]  ? __x64_sys_read+0x1a/0x20
> [  110.422834]  ? do_syscall_64+0x69/0xc0
> [  110.422839]  ? syscall_exit_to_user_mode+0x27/0x50
> [  110.422846]  ? __x64_sys_openat+0x20/0x30
> [  110.422853]  ? do_syscall_64+0x69/0xc0
> [  110.422857]  ? do_syscall_64+0x69/0xc0
> [  110.422862]  ? irqentry_exit+0x1d/0x30
> [  110.422868]  ? exc_page_fault+0x89/0x170
> [  110.422874]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
> [  110.422885] RIP: 0033:0x7f1576682a6b
> [  110.422892] Code: 73 01 c3 48 8b 0d 25 c4 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 c3 0c 00 f7 d8 64 89 01 48
> [  110.422897] RSP: 002b:00007ffcb96e0bf8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
> [  110.422904] RAX: ffffffffffffffda RBX: 000056347ba57550 RCX: 00007f1576682a6b
> [  110.422908] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000056347ba575b8
> [  110.422911] RBP: 000056347ba57550 R08: 0000000000000000 R09: 0000000000000000
> [  110.422913] R10: 00007f15766feac0 R11: 0000000000000206 R12: 000056347ba575b8
> [  110.422916] R13: 0000000000000000 R14: 000056347ba575b8 R15: 000056347ba57550
> [  110.422921]  </TASK>
> [  110.425941] [drm] amdgpu: ttm finalized
> [  110.489186] amdgpu 0000:83:00.0: amdgpu: amdgpu: finishing device.
> [  110.504025] [drm] free PSP TMR buffer
> [  110.762272] amdgpu: [dbg_xgmi_hive_put] ref_count 3
> [  110.762280] CPU: 27 PID: 1748 Comm: modprobe Tainted: G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
> [  110.762288] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
> [  110.762290] Call Trace:
> [  110.762294]  <TASK>
> [  110.762298]  dump_stack_lvl+0x4a/0x63
> [  110.762313]  dump_stack+0x10/0x16
> [  110.762319]  amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
> [  110.762663]  amdgpu_xgmi_remove_device+0x11d/0x1c0 [amdgpu]
> [  110.762965]  amdgpu_device_fini_sw+0x63/0x4c0 [amdgpu]
> [  110.763231]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
> [  110.763519]  drm_dev_release+0x28/0x50 [drm]
> [  110.763569]  devm_drm_dev_init_release+0x38/0x60 [drm]
> [  110.763609]  devm_action_release+0x15/0x20
> [  110.763617]  release_nodes+0x40/0xb0
> [  110.763624]  devres_release_all+0x9e/0xe0
> [  110.763631]  device_release_driver_internal+0x117/0x1f0
> [  110.763636]  driver_detach+0x4c/0xa0
> [  110.763640]  bus_remove_driver+0x6c/0xf0
> [  110.763646]  driver_unregister+0x31/0x50
> [  110.763650]  pci_unregister_driver+0x40/0x90
> [  110.763657]  amdgpu_exit+0x15/0x446 [amdgpu]
> [  110.764153]  __x64_sys_delete_module+0x14e/0x260
> [  110.764164]  ? do_syscall_64+0x69/0xc0
> [  110.764172]  ? __x64_sys_read+0x1a/0x20
> [  110.764180]  ? do_syscall_64+0x69/0xc0
> [  110.764184]  ? ksys_read+0x67/0xf0
> [  110.764189]  do_syscall_64+0x5c/0xc0
> [  110.764193]  ? __x64_sys_read+0x1a/0x20
> [  110.764197]  ? do_syscall_64+0x69/0xc0
> [  110.764202]  ? syscall_exit_to_user_mode+0x27/0x50
> [  110.764209]  ? __x64_sys_openat+0x20/0x30
> [  110.764217]  ? do_syscall_64+0x69/0xc0
> [  110.764221]  ? do_syscall_64+0x69/0xc0
> [  110.764226]  ? irqentry_exit+0x1d/0x30
> [  110.764232]  ? exc_page_fault+0x89/0x170
> [  110.764238]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
> [  110.764248] RIP: 0033:0x7f1576682a6b
> [  110.764255] Code: 73 01 c3 48 8b 0d 25 c4 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 c3 0c 00 f7 d8 64 89 01 48
> [  110.764260] RSP: 002b:00007ffcb96e0bf8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
> [  110.764267] RAX: ffffffffffffffda RBX: 000056347ba57550 RCX: 00007f1576682a6b
> [  110.764270] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000056347ba575b8
> [  110.764273] RBP: 000056347ba57550 R08: 0000000000000000 R09: 0000000000000000
> [  110.764275] R10: 00007f15766feac0 R11: 0000000000000206 R12: 000056347ba575b8
> [  110.764278] R13: 0000000000000000 R14: 000056347ba575b8 R15: 000056347ba57550
> [  110.764283]  </TASK>
> [  110.764326] amdgpu: [dbg_xgmi_hive_put] ref_count 2
> [  110.764329] CPU: 27 PID: 1748 Comm: modprobe Tainted: G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
> [  110.764334] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
> [  110.764336] Call Trace:
> [  110.764337]  <TASK>
> [  110.764339]  dump_stack_lvl+0x4a/0x63
> [  110.764347]  dump_stack+0x10/0x16
> [  110.764354]  amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
> [  110.764624]  amdgpu_xgmi_remove_device+0x1ad/0x1c0 [amdgpu]
> [  110.764791]  amdgpu_device_fini_sw+0x63/0x4c0 [amdgpu]
> [  110.764937]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
> [  110.765085]  drm_dev_release+0x28/0x50 [drm]
> [  110.765108]  devm_drm_dev_init_release+0x38/0x60 [drm]
> [  110.765130]  devm_action_release+0x15/0x20
> [  110.765134]  release_nodes+0x40/0xb0
> [  110.765137]  devres_release_all+0x9e/0xe0
> [  110.765141]  device_release_driver_internal+0x117/0x1f0
> [  110.765144]  driver_detach+0x4c/0xa0
> [  110.765146]  bus_remove_driver+0x6c/0xf0
> [  110.765148]  driver_unregister+0x31/0x50
> [  110.765150]  pci_unregister_driver+0x40/0x90
> [  110.765154]  amdgpu_exit+0x15/0x446 [amdgpu]
> [  110.765434]  __x64_sys_delete_module+0x14e/0x260
> [  110.765438]  ? do_syscall_64+0x69/0xc0
> [  110.765441]  ? __x64_sys_read+0x1a/0x20
> [  110.765444]  ? do_syscall_64+0x69/0xc0
> [  110.765446]  ? ksys_read+0x67/0xf0
> [  110.765449]  do_syscall_64+0x5c/0xc0
> [  110.765451]  ? __x64_sys_read+0x1a/0x20
> [  110.765454]  ? do_syscall_64+0x69/0xc0
> [  110.765456]  ? syscall_exit_to_user_mode+0x27/0x50
> [  110.765460]  ? __x64_sys_openat+0x20/0x30
> [  110.765464]  ? do_syscall_64+0x69/0xc0
> [  110.765466]  ? do_syscall_64+0x69/0xc0
> [  110.765469]  ? irqentry_exit+0x1d/0x30
> [  110.765472]  ? exc_page_fault+0x89/0x170
> [  110.765476]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
> [  110.765480] RIP: 0033:0x7f1576682a6b
> [  110.765482] Code: 73 01 c3 48 8b 0d 25 c4 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 c3 0c 00 f7 d8 64 89 01 48
> [  110.765485] RSP: 002b:00007ffcb96e0bf8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
> [  110.765488] RAX: ffffffffffffffda RBX: 000056347ba57550 RCX: 00007f1576682a6b
> [  110.765489] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000056347ba575b8
> [  110.765491] RBP: 000056347ba57550 R08: 0000000000000000 R09: 0000000000000000
> [  110.765492] R10: 00007f15766feac0 R11: 0000000000000206 R12: 000056347ba575b8
> [  110.765494] R13: 0000000000000000 R14: 000056347ba575b8 R15: 000056347ba57550
> [  110.765496]  </TASK>
> [  110.768091] [drm] amdgpu: ttm finalized
>
>> -----Original Message-----
>> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>> Sent: August 11, 2022 12:43 PM
>> To: Kim, Jonathan <Jonathan.Kim at amd.com>; Kuehling, Felix
>> <Felix.Kuehling at amd.com>; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info reference
>> leak
>>
>>
>> On 2022-08-11 11:34, Kim, Jonathan wrote:
>>> [Public]
>>>
>>>> -----Original Message-----
>>>> From: Kuehling, Felix <Felix.Kuehling at amd.com>
>>>> Sent: August 11, 2022 11:19 AM
>>>> To: amd-gfx at lists.freedesktop.org; Kim, Jonathan
>> <Jonathan.Kim at amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info reference
>>>> leak
>>>>
>>>> Am 2022-08-11 um 09:42 schrieb Jonathan Kim:
>>>>> When an xgmi node is added to the hive, it takes another hive
>>>>> reference for its reset domain.
>>>>>
>>>>> This extra reference was not dropped on device removal from the
>>>>> hive so drop it.
>>>>>
>>>>> Signed-off-by: Jonathan Kim <jonathan.kim at amd.com>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 3 +++
>>>>>     1 file changed, 3 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>>>>> index 1b108d03e785..560bf1c98f08 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>>>>> @@ -731,6 +731,9 @@ int amdgpu_xgmi_remove_device(struct
>>>> amdgpu_device *adev)
>>>>>       mutex_unlock(&hive->hive_lock);
>>>>>
>>>>>       amdgpu_put_xgmi_hive(hive);
>>>>> +   /* device is removed from the hive so remove its reset domain
>>>> reference */
>>>>> +   if (adev->reset_domain && adev->reset_domain == hive-
>>>>> reset_domain)
>>>>> +           amdgpu_put_xgmi_hive(hive);
>>>> This is some messed up reference counting. If you need an extra
>>>> reference from the reset_domain to the hive, that should be owned by the
>>>> reset_domain and dropped when the reset_domain is destroyed. And it's
>>>> only one reference for the reset_domain, not one reference per adev in
>>>> the reset_domain.
>>> Cc'ing Andrey.
>>>
>>> What you're saying seems to make more sense to me, but what I got from an
>> offline conversation with Andrey
>>> was that the reset domain reference per device was intentional.
>>> Maybe Andrey can comment here.
>>>
>>>> What you're doing here looks like every adev that's in a reset_domain of
>>>> its hive has two references to the hive. And if you're dropping the
>>>> extra reference here, it still leaves the reset_domain with a dangling
>>>> pointer to a hive that may no longer exist. So this extra reference is
>>>> kind of pointless.
>>
>> reset_domain doesn't have any references to the hive, the hive has a
>> reference to reset_domain
>>
>>
>>> Yes.  Currently one reference is fetched from the device's lifetime on the hive
>> and the other is from the
>>> per-device reset domain.
>>>
>>> Snippet from amdgpu_device_ip_init:
>>>           /**
>>>            * In case of XGMI grab extra reference for reset domain for this device
>>>            */
>>>           if (adev->gmc.xgmi.num_physical_nodes > 1) {
>>>                   if (amdgpu_xgmi_add_device(adev) == 0) { <- [JK] reference is
>> fetched here
>>
>>
>> amdgpu_xgmi_add_device calls  amdgpu_get_xgmi_hive and only on the first
>> time amdgpu_get_xgmi_hive is called and hive is actually allocated and
>> initialized  will we proceed
>> to creating the reset domain either from scratch (first creation of the
>> hive) or by taking reference from adev (see [1])
>>
>>
>>
>> [1] -
>> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/a
>> mdgpu_xgmi.c#L394
>>
>>>                           struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev);
>> <- [JK] then here again
>>
>>
>> So here I don't see how an extra reference to reset_domain is taken if
>> amdgpu_get_xgmi_hive returns early since the hive already created and
>> exists in the global hive container ?
>>
>> Johantan - can u please show the exact flow how recount leak on
>> reset_domain is happening ?
>>
>> Andrey
>>
>>
>>>                           if (!hive->reset_domain ||
>>>                               !amdgpu_reset_get_reset_domain(hive->reset_domain)) {
>>>                                   r = -ENOENT;
>>>                                   goto init_failed;
>>>                           }
>>>
>>>                           /* Drop the early temporary reset domain we created for device
>> */
>>>                           amdgpu_reset_put_reset_domain(adev->reset_domain);
>>>                           adev->reset_domain = hive->reset_domain;
>>>                   }
>>>           }
>>>
>>> One of these never gets dropped so a leak happens.
>>> So either the extra reference has to be dropped on device removal from the
>> hive or from what you've mentioned,
>>> the reset_domain reference fetch should be fixed to grab at the
>> hive/reset_domain level.
>>> Thanks,
>>>
>>> Jon
>>>
>>>> Regards,
>>>>      Felix
>>>>
>>>>
>>>>>       adev->hive = NULL;
>>>>>
>>>>>       if (atomic_dec_return(&hive->number_devices) == 0) {


More information about the amd-gfx mailing list