[PATCH] drm/amdgpu: fix reset domain xgmi hive info reference leak

Kim, Jonathan Jonathan.Kim at amd.com
Fri Aug 12 22:28:18 UTC 2022


[Public]

> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling at amd.com>
> Sent: August 12, 2022 6:12 PM
> To: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Kim, Jonathan
> <Jonathan.Kim at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info reference
> leak
>
>
> On 2022-08-12 18:05, Andrey Grodzovsky wrote:
> >
> > On 2022-08-12 14:38, Kim, Jonathan wrote:
> >> [Public]
> >>
> >> Hi Andrey,
> >>
> >> Here's the load/unload stack trace.  This is a 2 GPU xGMI system.  I
> >> put dbg_xgmi_hive_get/put refcount print post kobj get/put.
> >> It's stuck at 2 on unload.  If it's an 8 GPU system, it's stuck at 8.
> >>
> >> e.g. of sysfs leak after driver unload:
> >>
> atitest at atitest:/sys/devices/pci0000:80/0000:80:02.0/0000:81:00.0/0000:82:00
> .0/0000:83:00.0$
> >> ls xgmi_hive_info/
> >> xgmi_hive_id
> >>
> >> Thanks,
> >>
> >> Jon
> >
> >
> > I see the leak, but how is it related to amdgpu_reset_domain ? How you
> > think that he causing this ?
> Does YiPeng's patch "[PATCH 2/2] drm/amdgpu: fix hive reference leak
> when adding xgmi device" address the same issue?

Yes, this is the extra reference I was talking about in the snippet I posted.

Thanks,

Jon

>
> Regards,
>    Felix
>
>
> >
> > Andrey
> >
> >
> >>
> >>
> >> Driver load (get ref happens on both device add to hive and init per
> >> device):
> >> [   61.975900] amdkcl: loading out-of-tree module taints kernel.
> >> [   61.975973] amdkcl: module verification failed: signature and/or
> >> required key missing - tainting kernel
> >> [   62.065546] amdkcl: Warning: fail to get symbol cancel_work,
> >> replace it with kcl stub
> >> [   62.081920] AMD-Vi: AMD IOMMUv2 functionality not available on
> >> this system - This is not a bug.
> >> [   62.491119] [drm] amdgpu kernel modesetting enabled.
> >> [   62.491122] [drm] amdgpu version: 5.18.2
> >> [   62.491124] [drm] OS DRM version: 5.15.0
> >> [   62.491337] amdgpu: CRAT table not found
> >> [   62.491341] amdgpu: Virtual CRAT table created for CPU
> >> [   62.491360] amdgpu: Topology: Add CPU node
> >> [   62.603556] amdgpu: PeerDirect support was initialized successfully
> >> [   62.603847] amdgpu 0000:83:00.0: enabling device (0100 -> 0102)
> >> [   62.603987] [drm] initializing kernel modesetting (VEGA20
> >> 0x1002:0x66A1 0x1002:0x0834 0x00).
> >> [   62.604023] [drm] register mmio base: 0xFBD00000
> >> [   62.604026] [drm] register mmio size: 524288
> >> [   62.604171] [drm] add ip block number 0 <soc15_common>
> >> [   62.604175] [drm] add ip block number 1 <gmc_v9_0>
> >> [   62.604177] [drm] add ip block number 2 <vega20_ih>
> >> [   62.604180] [drm] add ip block number 3 <psp>
> >> [   62.604182] [drm] add ip block number 4 <powerplay>
> >> [   62.604185] [drm] add ip block number 5 <dm>
> >> [   62.604187] [drm] add ip block number 6 <gfx_v9_0>
> >> [   62.604190] [drm] add ip block number 7 <sdma_v4_0>
> >> [   62.604192] [drm] add ip block number 8 <uvd_v7_0>
> >> [   62.604194] [drm] add ip block number 9 <vce_v4_0>
> >> [   62.641771] amdgpu 0000:83:00.0: amdgpu: Fetched VBIOS from ROM BAR
> >> [   62.641777] amdgpu: ATOM BIOS: 113-D1630200-112
> >> [   62.713418] [drm] UVD(0) is enabled in VM mode
> >> [   62.713423] [drm] UVD(1) is enabled in VM mode
> >> [   62.713426] [drm] UVD(0) ENC is enabled in VM mode
> >> [   62.713428] [drm] UVD(1) ENC is enabled in VM mode
> >> [   62.713430] [drm] VCE enabled in VM mode
> >> [   62.713433] amdgpu 0000:83:00.0: amdgpu: Trusted Memory Zone (TMZ)
> >> feature not supported
> >> [   62.713472] [drm] GPU posting now...
> >> [   62.713993] amdgpu 0000:83:00.0: amdgpu: MEM ECC is active.
> >> [   62.713995] amdgpu 0000:83:00.0: amdgpu: SRAM ECC is active.
> >> [   62.714006] amdgpu 0000:83:00.0: amdgpu: RAS INFO: ras initialized
> >> successfully, hardware ability[7fff] ras_mask[7fff]
> >> [   62.714018] [drm] vm size is 262144 GB, 4 levels, block size is
> >> 9-bit, fragment size is 9-bit
> >> [   62.714026] amdgpu 0000:83:00.0: amdgpu: VRAM: 32752M
> >> 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
> >> [   62.714029] amdgpu 0000:83:00.0: amdgpu: GART: 512M
> >> 0x0000000000000000 - 0x000000001FFFFFFF
> >> [   62.714032] amdgpu 0000:83:00.0: amdgpu: AGP: 267845632M
> >> 0x0000009000000000 - 0x0000FFFFFFFFFFFF
> >> [   62.714043] [drm] Detected VRAM RAM=32752M, BAR=32768M
> >> [   62.714044] [drm] RAM width 4096bits HBM
> >> [   62.714050] debugfs: Directory 'ttm' with parent '/' already present!
> >> [   62.714146] [drm] amdgpu: 32752M of VRAM memory ready
> >> [   62.714149] [drm] amdgpu: 40203M of GTT memory ready.
> >> [   62.714170] [drm] GART: num cpu pages 131072, num gpu pages 131072
> >> [   62.714266] [drm] PCIE GART of 512M enabled.
> >> [   62.714267] [drm] PTB located at 0x0000008000000000
> >> [   62.731067] amdgpu 0000:83:00.0: amdgpu: PSP runtime database
> >> doesn't exist
> >> [   62.731075] amdgpu 0000:83:00.0: amdgpu: PSP runtime database
> >> doesn't exist
> >> [   62.731449] amdgpu: [powerplay] hwmgr_sw_init smu backed is
> >> vega20_smu
> >> [   62.743177] [drm] Found UVD firmware ENC: 1.2 DEC: .43 Family ID: 19
> >> [   62.743244] [drm] PSP loading UVD firmware
> >> [   62.744525] [drm] Found VCE firmware Version: 57.6 Binary ID: 4
> >> [   62.744689] [drm] PSP loading VCE firmware
> >> [   62.896804] [drm] reserve 0x400000 from 0x87fec00000 for PSP TMR
> >> [   62.979421] amdgpu 0000:83:00.0: amdgpu: HDCP: optional hdcp ta
> >> ucode is not available
> >> [   62.979427] amdgpu 0000:83:00.0: amdgpu: DTM: optional dtm ta
> >> ucode is not available
> >> [   62.979430] amdgpu 0000:83:00.0: amdgpu: RAP: optional rap ta
> >> ucode is not available
> >> [   62.979432] amdgpu 0000:83:00.0: amdgpu: SECUREDISPLAY:
> >> securedisplay ta ucode is not available
> >> [   62.982386] [drm] Display Core initialized with v3.2.196!
> >> [   62.984514] [drm] kiq ring mec 2 pipe 1 q 0
> >> [   63.026846] [drm] UVD and UVD ENC initialized successfully.
> >> [   63.225760] [drm] VCE initialized successfully.
> >> [   63.244442] amdgpu: [dbg_xgmi_hive_get] ref_count 2
> >> [   63.244448] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted:
> >> G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
> >> [   63.244454] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1
> >> 09/14/2018
> >> [   63.244457] Workqueue: events work_for_cpu_fn
> >> [   63.244471] Call Trace:
> >> [   63.244474]  <TASK>
> >> [   63.244479]  dump_stack_lvl+0x4a/0x63
> >> [   63.244493]  dump_stack+0x10/0x16
> >> [   63.244501]  amdgpu_get_xgmi_hive+0x217/0x2a0 [amdgpu]
> >> [   63.245047]  amdgpu_xgmi_add_device+0xcc/0x450 [amdgpu]
> >> [   63.245463]  ? amdgpu_ras_recovery_init+0x13d/0x2e0 [amdgpu]
> >> [   63.245879]  ? vce_v4_0_hw_init.cold+0xc/0x13 [amdgpu]
> >> [   63.246466]  amdgpu_device_init.cold+0x15bd/0x1fe3 [amdgpu]
> >> [   63.247055]  ? pci_bus_read_config_word+0x4a/0x70
> >> [   63.247064]  ? do_pci_enable_device+0xdb/0x110
> >> [   63.247070]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
> >> [   63.247463]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
> >> [   63.247868]  local_pci_probe+0x4b/0x90
> >> [   63.247876]  work_for_cpu_fn+0x1a/0x30
> >> [   63.247881]  process_one_work+0x22b/0x3d0
> >> [   63.247887]  worker_thread+0x21d/0x3f0
> >> [   63.247893]  ? process_one_work+0x3d0/0x3d0
> >> [   63.247898]  kthread+0x12a/0x150
> >> [   63.247905]  ? set_kthread_struct+0x50/0x50
> >> [   63.247910]  ret_from_fork+0x22/0x30
> >> [   63.247922]  </TASK>
> >> [   63.248563] amdgpu 0000:83:00.0: amdgpu: XGMI: Add node 0, hive
> >> 0x25bbae7e3fd04cf4.
> >> [   63.248569] amdgpu: [dbg_xgmi_hive_get] ref_count 3
> >> [   63.248572] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted:
> >> G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
> >> [   63.248578] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1
> >> 09/14/2018
> >> [   63.248580] Workqueue: events work_for_cpu_fn
> >> [   63.248587] Call Trace:
> >> [   63.248588]  <TASK>
> >> [   63.248590]  dump_stack_lvl+0x4a/0x63
> >> [   63.248598]  dump_stack+0x10/0x16
> >> [   63.248604]  amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu]
> >> [   63.249033]  amdgpu_device_init.cold+0x15cd/0x1fe3 [amdgpu]
> >> [   63.249621]  ? pci_bus_read_config_word+0x4a/0x70
> >> [   63.249627]  ? do_pci_enable_device+0xdb/0x110
> >> [   63.249632]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
> >> [   63.250022]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
> >> [   63.250410]  local_pci_probe+0x4b/0x90
> >> [   63.250416]  work_for_cpu_fn+0x1a/0x30
> >> [   63.250421]  process_one_work+0x22b/0x3d0
> >> [   63.250428]  worker_thread+0x21d/0x3f0
> >> [   63.250434]  ? process_one_work+0x3d0/0x3d0
> >> [   63.250440]  kthread+0x12a/0x150
> >> [   63.250445]  ? set_kthread_struct+0x50/0x50
> >> [   63.250450]  ret_from_fork+0x22/0x30
> >> [   63.250458]  </TASK>
> >> [   63.268869] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
> >> [   63.269180] amdgpu: sdma_bitmap: ffff
> >> [   63.605188] memmap_init_zone_device initialised 8388608 pages in
> >> 132ms
> >> [   63.605203] amdgpu: HMM registered 32752MB device memory
> >> [   63.605244] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled!
> >>
> >> [   63.605263] amdgpu: Virtual CRAT table created for GPU
> >> [   63.605651] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled!
> >>
> >> [   63.605659] amdgpu: Topology: Add dGPU node [0x66a1:0x1002]
> >> [   63.605670] kfd kfd: amdgpu: added device 1002:66a1
> >> [   63.626300] amdgpu 0000:83:00.0: amdgpu: SE 4, SH per SE 1, CU per
> >> SH 16, active_cu_number 64
> >> [   63.626517] amdgpu 0000:83:00.0: amdgpu: ring gfx uses VM inv eng
> >> 0 on hub 0
> >> [   63.626522] amdgpu 0000:83:00.0: amdgpu: ring comp_1.0.0 uses VM
> >> inv eng 1 on hub 0
> >> [   63.626525] amdgpu 0000:83:00.0: amdgpu: ring comp_1.1.0 uses VM
> >> inv eng 4 on hub 0
> >> [   63.626529] amdgpu 0000:83:00.0: amdgpu: ring comp_1.2.0 uses VM
> >> inv eng 5 on hub 0
> >> [   63.626531] amdgpu 0000:83:00.0: amdgpu: ring comp_1.3.0 uses VM
> >> inv eng 6 on hub 0
> >> [   63.626534] amdgpu 0000:83:00.0: amdgpu: ring comp_1.0.1 uses VM
> >> inv eng 7 on hub 0
> >> [   63.626537] amdgpu 0000:83:00.0: amdgpu: ring comp_1.1.1 uses VM
> >> inv eng 8 on hub 0
> >> [   63.626540] amdgpu 0000:83:00.0: amdgpu: ring comp_1.2.1 uses VM
> >> inv eng 9 on hub 0
> >> [   63.626543] amdgpu 0000:83:00.0: amdgpu: ring comp_1.3.1 uses VM
> >> inv eng 10 on hub 0
> >> [   63.626546] amdgpu 0000:83:00.0: amdgpu: ring kiq_2.1.0 uses VM
> >> inv eng 11 on hub 0
> >> [   63.626549] amdgpu 0000:83:00.0: amdgpu: ring sdma0 uses VM inv
> >> eng 0 on hub 1
> >> [   63.626552] amdgpu 0000:83:00.0: amdgpu: ring page0 uses VM inv
> >> eng 1 on hub 1
> >> [   63.626555] amdgpu 0000:83:00.0: amdgpu: ring sdma1 uses VM inv
> >> eng 4 on hub 1
> >> [   63.626558] amdgpu 0000:83:00.0: amdgpu: ring page1 uses VM inv
> >> eng 5 on hub 1
> >> [   63.626561] amdgpu 0000:83:00.0: amdgpu: ring uvd_0 uses VM inv
> >> eng 6 on hub 1
> >> [   63.626563] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_0.0 uses VM
> >> inv eng 7 on hub 1
> >> [   63.626566] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_0.1 uses VM
> >> inv eng 8 on hub 1
> >> [   63.626569] amdgpu 0000:83:00.0: amdgpu: ring uvd_1 uses VM inv
> >> eng 9 on hub 1
> >> [   63.626572] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_1.0 uses VM
> >> inv eng 10 on hub 1
> >> [   63.626575] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_1.1 uses VM
> >> inv eng 11 on hub 1
> >> [   63.626577] amdgpu 0000:83:00.0: amdgpu: ring vce0 uses VM inv eng
> >> 12 on hub 1
> >> [   63.626580] amdgpu 0000:83:00.0: amdgpu: ring vce1 uses VM inv eng
> >> 13 on hub 1
> >> [   63.626583] amdgpu 0000:83:00.0: amdgpu: ring vce2 uses VM inv eng
> >> 14 on hub 1
> >> [   63.636996] amdgpu: Detected AMDGPU DF Counters. # of Counters = 8.
> >> [   63.637046] amdgpu: Detected AMDGPU 2 Perf Events.
> >> [   63.637428] [drm] Initialized amdgpu 3.48.0 20150101 for
> >> 0000:83:00.0 on minor 1
> >> [   63.637937] amdgpu 0000:86:00.0: enabling device (0100 -> 0102)
> >> [   63.638043] [drm] initializing kernel modesetting (VEGA20
> >> 0x1002:0x66A1 0x1002:0x0834 0x00).
> >> [   63.638090] [drm] register mmio base: 0xFBB00000
> >> [   63.638092] [drm] register mmio size: 524288
> >> [   63.638261] [drm] add ip block number 0 <soc15_common>
> >> [   63.638263] [drm] add ip block number 1 <gmc_v9_0>
> >> [   63.638265] [drm] add ip block number 2 <vega20_ih>
> >> [   63.638266] [drm] add ip block number 3 <psp>
> >> [   63.638267] [drm] add ip block number 4 <powerplay>
> >> [   63.638269] [drm] add ip block number 5 <dm>
> >> [   63.638271] [drm] add ip block number 6 <gfx_v9_0>
> >> [   63.638272] [drm] add ip block number 7 <sdma_v4_0>
> >> [   63.638273] [drm] add ip block number 8 <uvd_v7_0>
> >> [   63.638275] [drm] add ip block number 9 <vce_v4_0>
> >> [   63.675838] amdgpu 0000:86:00.0: amdgpu: Fetched VBIOS from ROM BAR
> >> [   63.675842] amdgpu: ATOM BIOS: 113-D1630200-112
> >> [   63.675867] [drm] UVD(0) is enabled in VM mode
> >> [   63.675868] [drm] UVD(1) is enabled in VM mode
> >> [   63.675869] [drm] UVD(0) ENC is enabled in VM mode
> >> [   63.675870] [drm] UVD(1) ENC is enabled in VM mode
> >> [   63.675871] [drm] VCE enabled in VM mode
> >> [   63.675873] amdgpu 0000:86:00.0: amdgpu: Trusted Memory Zone (TMZ)
> >> feature not supported
> >> [   63.675899] [drm] GPU posting now...
> >> [   63.676276] amdgpu 0000:86:00.0: amdgpu: MEM ECC is active.
> >> [   63.676277] amdgpu 0000:86:00.0: amdgpu: SRAM ECC is active.
> >> [   63.676286] amdgpu 0000:86:00.0: amdgpu: RAS INFO: ras initialized
> >> successfully, hardware ability[7fff] ras_mask[7fff]
> >> [   63.676297] [drm] vm size is 262144 GB, 4 levels, block size is
> >> 9-bit, fragment size is 9-bit
> >> [   63.676304] amdgpu 0000:86:00.0: amdgpu: VRAM: 32752M
> >> 0x0000008800000000 - 0x0000008FFEFFFFFF (32752M used)
> >> [   63.676307] amdgpu 0000:86:00.0: amdgpu: GART: 512M
> >> 0x0000000000000000 - 0x000000001FFFFFFF
> >> [   63.676310] amdgpu 0000:86:00.0: amdgpu: AGP: 267845632M
> >> 0x0000009000000000 - 0x0000FFFFFFFFFFFF
> >> [   63.676321] [drm] Detected VRAM RAM=32752M, BAR=32768M
> >> [   63.676322] [drm] RAM width 4096bits HBM
> >> [   63.676363] [drm] amdgpu: 32752M of VRAM memory ready
> >> [   63.676365] [drm] amdgpu: 40203M of GTT memory ready.
> >> [   63.676388] [drm] GART: num cpu pages 131072, num gpu pages 131072
> >> [   63.676481] [drm] PCIE GART of 512M enabled.
> >> [   63.676482] [drm] PTB located at 0x0000008800000000
> >> [   63.676730] amdgpu 0000:86:00.0: amdgpu: PSP runtime database
> >> doesn't exist
> >> [   63.676733] amdgpu 0000:86:00.0: amdgpu: PSP runtime database
> >> doesn't exist
> >> [   63.677088] amdgpu: [powerplay] hwmgr_sw_init smu backed is
> >> vega20_smu
> >> [   63.678862] [drm] Found UVD firmware ENC: 1.2 DEC: .43 Family ID: 19
> >> [   63.678918] [drm] PSP loading UVD firmware
> >> [   63.679487] [drm] Found VCE firmware Version: 57.6 Binary ID: 4
> >> [   63.679619] [drm] PSP loading VCE firmware
> >> [   63.831730] [drm] reserve 0x400000 from 0x8ffec00000 for PSP TMR
> >> [   63.914508] amdgpu 0000:86:00.0: amdgpu: HDCP: optional hdcp ta
> >> ucode is not available
> >> [   63.914513] amdgpu 0000:86:00.0: amdgpu: DTM: optional dtm ta
> >> ucode is not available
> >> [   63.914516] amdgpu 0000:86:00.0: amdgpu: RAP: optional rap ta
> >> ucode is not available
> >> [   63.914518] amdgpu 0000:86:00.0: amdgpu: SECUREDISPLAY:
> >> securedisplay ta ucode is not available
> >> [   63.917458] [drm] Display Core initialized with v3.2.196!
> >> [   63.919616] [drm] kiq ring mec 2 pipe 1 q 0
> >> [   63.961950] [drm] UVD and UVD ENC initialized successfully.
> >> [   64.160863] [drm] VCE initialized successfully.
> >> [   64.179285] amdgpu: [dbg_xgmi_hive_get] ref_count 4
> >> [   64.179291] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted:
> >> G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
> >> [   64.179297] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1
> >> 09/14/2018
> >> [   64.179299] Workqueue: events work_for_cpu_fn
> >> [   64.179311] Call Trace:
> >> [   64.179315]  <TASK>
> >> [   64.179320]  dump_stack_lvl+0x4a/0x63
> >> [   64.179331]  dump_stack+0x10/0x16
> >> [   64.179340]  amdgpu_get_xgmi_hive+0x217/0x2a0 [amdgpu]
> >> [   64.179904]  amdgpu_xgmi_add_device+0xcc/0x450 [amdgpu]
> >> [   64.180318]  ? amdgpu_ras_recovery_init+0x13d/0x2e0 [amdgpu]
> >> [   64.180733]  ? vce_v4_0_hw_init.cold+0xc/0x13 [amdgpu]
> >> [   64.181321]  amdgpu_device_init.cold+0x15bd/0x1fe3 [amdgpu]
> >> [   64.181909]  ? pci_bus_read_config_word+0x4a/0x70
> >> [   64.181917]  ? do_pci_enable_device+0xdb/0x110
> >> [   64.181923]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
> >> [   64.182315]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
> >> [   64.182703]  local_pci_probe+0x4b/0x90
> >> [   64.182710]  work_for_cpu_fn+0x1a/0x30
> >> [   64.182715]  process_one_work+0x22b/0x3d0
> >> [   64.182722]  worker_thread+0x21d/0x3f0
> >> [   64.182728]  ? process_one_work+0x3d0/0x3d0
> >> [   64.182734]  kthread+0x12a/0x150
> >> [   64.182740]  ? set_kthread_struct+0x50/0x50
> >> [   64.182745]  ret_from_fork+0x22/0x30
> >> [   64.182756]  </TASK>
> >> [   64.184561] amdgpu 0000:86:00.0: amdgpu: XGMI: Add node 1, hive
> >> 0x25bbae7e3fd04cf4.
> >> [   64.184568] amdgpu: [dbg_xgmi_hive_get] ref_count 5
> >> [   64.184571] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted:
> >> G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
> >> [   64.184576] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1
> >> 09/14/2018
> >> [   64.184578] Workqueue: events work_for_cpu_fn
> >> [   64.184585] Call Trace:
> >> [   64.184587]  <TASK>
> >> [   64.184589]  dump_stack_lvl+0x4a/0x63
> >> [   64.184596]  dump_stack+0x10/0x16
> >> [   64.184602]  amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu]
> >> [   64.185041]  amdgpu_device_init.cold+0x15cd/0x1fe3 [amdgpu]
> >> [   64.185624]  ? pci_bus_read_config_word+0x4a/0x70
> >> [   64.185631]  ? do_pci_enable_device+0xdb/0x110
> >> [   64.185636]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
> >> [   64.186027]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
> >> [   64.186416]  local_pci_probe+0x4b/0x90
> >> [   64.186422]  work_for_cpu_fn+0x1a/0x30
> >> [   64.186428]  process_one_work+0x22b/0x3d0
> >> [   64.186434]  worker_thread+0x21d/0x3f0
> >> [   64.186439]  ? process_one_work+0x3d0/0x3d0
> >> [   64.186445]  kthread+0x12a/0x150
> >> [   64.186450]  ? set_kthread_struct+0x50/0x50
> >> [   64.186455]  ret_from_fork+0x22/0x30
> >> [   64.186464]  </TASK>
> >> [   64.206119] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
> >> [   64.206433] amdgpu: sdma_bitmap: ffff
> >> [   64.552064] memmap_init_zone_device initialised 8388608 pages in
> >> 132ms
> >> [   64.552080] amdgpu: HMM registered 32752MB device memory
> >> [   64.552116] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled!
> >>
> >> [   64.552138] amdgpu: Virtual CRAT table created for GPU
> >> [   64.552978] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled!
> >>
> >> [   64.552988] amdgpu: Topology: Add dGPU node [0x66a1:0x1002]
> >> [   64.552999] kfd kfd: amdgpu: added device 1002:66a1
> >> [   64.570314] amdgpu 0000:86:00.0: amdgpu: SE 4, SH per SE 1, CU per
> >> SH 16, active_cu_number 64
> >> [   64.570527] amdgpu 0000:86:00.0: amdgpu: ring gfx uses VM inv eng
> >> 0 on hub 0
> >> [   64.570531] amdgpu 0000:86:00.0: amdgpu: ring comp_1.0.0 uses VM
> >> inv eng 1 on hub 0
> >> [   64.570535] amdgpu 0000:86:00.0: amdgpu: ring comp_1.1.0 uses VM
> >> inv eng 4 on hub 0
> >> [   64.570538] amdgpu 0000:86:00.0: amdgpu: ring comp_1.2.0 uses VM
> >> inv eng 5 on hub 0
> >> [   64.570541] amdgpu 0000:86:00.0: amdgpu: ring comp_1.3.0 uses VM
> >> inv eng 6 on hub 0
> >> [   64.570544] amdgpu 0000:86:00.0: amdgpu: ring comp_1.0.1 uses VM
> >> inv eng 7 on hub 0
> >> [   64.570547] amdgpu 0000:86:00.0: amdgpu: ring comp_1.1.1 uses VM
> >> inv eng 8 on hub 0
> >> [   64.570550] amdgpu 0000:86:00.0: amdgpu: ring comp_1.2.1 uses VM
> >> inv eng 9 on hub 0
> >> [   64.570552] amdgpu 0000:86:00.0: amdgpu: ring comp_1.3.1 uses VM
> >> inv eng 10 on hub 0
> >> [   64.570556] amdgpu 0000:86:00.0: amdgpu: ring kiq_2.1.0 uses VM
> >> inv eng 11 on hub 0
> >> [   64.570559] amdgpu 0000:86:00.0: amdgpu: ring sdma0 uses VM inv
> >> eng 0 on hub 1
> >> [   64.570562] amdgpu 0000:86:00.0: amdgpu: ring page0 uses VM inv
> >> eng 1 on hub 1
> >> [   64.570565] amdgpu 0000:86:00.0: amdgpu: ring sdma1 uses VM inv
> >> eng 4 on hub 1
> >> [   64.570567] amdgpu 0000:86:00.0: amdgpu: ring page1 uses VM inv
> >> eng 5 on hub 1
> >> [   64.570570] amdgpu 0000:86:00.0: amdgpu: ring uvd_0 uses VM inv
> >> eng 6 on hub 1
> >> [   64.570573] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_0.0 uses VM
> >> inv eng 7 on hub 1
> >> [   64.570576] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_0.1 uses VM
> >> inv eng 8 on hub 1
> >> [   64.570579] amdgpu 0000:86:00.0: amdgpu: ring uvd_1 uses VM inv
> >> eng 9 on hub 1
> >> [   64.570581] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_1.0 uses VM
> >> inv eng 10 on hub 1
> >> [   64.570584] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_1.1 uses VM
> >> inv eng 11 on hub 1
> >> [   64.570587] amdgpu 0000:86:00.0: amdgpu: ring vce0 uses VM inv eng
> >> 12 on hub 1
> >> [   64.570589] amdgpu 0000:86:00.0: amdgpu: ring vce1 uses VM inv eng
> >> 13 on hub 1
> >> [   64.570592] amdgpu 0000:86:00.0: amdgpu: ring vce2 uses VM inv eng
> >> 14 on hub 1
> >> [   64.581070] amdgpu: [dbg_xgmi_hive_get] ref_count 6
> >> [   64.581075] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted:
> >> G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
> >> [   64.581079] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1
> >> 09/14/2018
> >> [   64.581081] Workqueue: events work_for_cpu_fn
> >> [   64.581089] Call Trace:
> >> [   64.581091]  <TASK>
> >> [   64.581094]  dump_stack_lvl+0x4a/0x63
> >> [   64.581103]  dump_stack+0x10/0x16
> >> [   64.581109]  amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu]
> >> [   64.581489]  amdgpu_xgmi_set_pstate+0xe/0x30 [amdgpu]
> >> [   64.581723]  amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu]
> >> [   64.581943]  amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu]
> >> [   64.582288]  ? pci_bus_read_config_word+0x4a/0x70
> >> [   64.582295]  ? do_pci_enable_device+0xdb/0x110
> >> [   64.582298]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
> >> [   64.582520]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
> >> [   64.582738]  local_pci_probe+0x4b/0x90
> >> [   64.582743]  work_for_cpu_fn+0x1a/0x30
> >> [   64.582746]  process_one_work+0x22b/0x3d0
> >> [   64.582750]  worker_thread+0x21d/0x3f0
> >> [   64.582753]  ? process_one_work+0x3d0/0x3d0
> >> [   64.582756]  kthread+0x12a/0x150
> >> [   64.582761]  ? set_kthread_struct+0x50/0x50
> >> [   64.582764]  ret_from_fork+0x22/0x30
> >> [   64.582772]  </TASK>
> >> [   64.582774] amdgpu: [dbg_xgmi_hive_put] ref_count 5
> >> [   64.582775] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted:
> >> G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
> >> [   64.582778] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1
> >> 09/14/2018
> >> [   64.582779] Workqueue: events work_for_cpu_fn
> >> [   64.582782] Call Trace:
> >> [   64.582783]  <TASK>
> >> [   64.582784]  dump_stack_lvl+0x4a/0x63
> >> [   64.582789]  dump_stack+0x10/0x16
> >> [   64.582792]  amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
> >> [   64.583028]  amdgpu_xgmi_set_pstate+0x1b/0x30 [amdgpu]
> >> [   64.583262]  amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu]
> >> [   64.583482]  amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu]
> >> [   64.583833]  ? pci_bus_read_config_word+0x4a/0x70
> >> [   64.583836]  ? do_pci_enable_device+0xdb/0x110
> >> [   64.583840]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
> >> [   64.584072]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
> >> [   64.584304]  local_pci_probe+0x4b/0x90
> >> [   64.584307]  work_for_cpu_fn+0x1a/0x30
> >> [   64.584311]  process_one_work+0x22b/0x3d0
> >> [   64.584314]  worker_thread+0x21d/0x3f0
> >> [   64.584318]  ? process_one_work+0x3d0/0x3d0
> >> [   64.584321]  kthread+0x12a/0x150
> >> [   64.584324]  ? set_kthread_struct+0x50/0x50
> >> [   64.584327]  ret_from_fork+0x22/0x30
> >> [   64.584333]  </TASK>
> >> [   64.584342] amdgpu: [dbg_xgmi_hive_get] ref_count 6
> >> [   64.584344] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted:
> >> G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
> >> [   64.584347] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1
> >> 09/14/2018
> >> [   64.584348] Workqueue: events work_for_cpu_fn
> >> [   64.584352] Call Trace:
> >> [   64.584353]  <TASK>
> >> [   64.584354]  dump_stack_lvl+0x4a/0x63
> >> [   64.584358]  dump_stack+0x10/0x16
> >> [   64.584362]  amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu]
> >> [   64.584610]  amdgpu_xgmi_set_pstate+0xe/0x30 [amdgpu]
> >> [   64.584856]  amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu]
> >> [   64.585086]  amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu]
> >> [   64.585437]  ? pci_bus_read_config_word+0x4a/0x70
> >> [   64.585440]  ? do_pci_enable_device+0xdb/0x110
> >> [   64.585443]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
> >> [   64.585679]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
> >> [   64.585922]  local_pci_probe+0x4b/0x90
> >> [   64.585926]  work_for_cpu_fn+0x1a/0x30
> >> [   64.585929]  process_one_work+0x22b/0x3d0
> >> [   64.585932]  worker_thread+0x21d/0x3f0
> >> [   64.585936]  ? process_one_work+0x3d0/0x3d0
> >> [   64.585939]  kthread+0x12a/0x150
> >> [   64.585942]  ? set_kthread_struct+0x50/0x50
> >> [   64.585945]  ret_from_fork+0x22/0x30
> >> [   64.585950]  </TASK>
> >> [   64.585951] amdgpu: [dbg_xgmi_hive_put] ref_count 5
> >> [   64.585953] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted:
> >> G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
> >> [   64.585956] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1
> >> 09/14/2018
> >> [   64.585957] Workqueue: events work_for_cpu_fn
> >> [   64.585960] Call Trace:
> >> [   64.585961]  <TASK>
> >> [   64.585963]  dump_stack_lvl+0x4a/0x63
> >> [   64.585967]  dump_stack+0x10/0x16
> >> [   64.585970]  amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
> >> [   64.586213]  amdgpu_xgmi_set_pstate+0x1b/0x30 [amdgpu]
> >> [   64.586458]  amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu]
> >> [   64.586688]  amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu]
> >> [   64.587037]  ? pci_bus_read_config_word+0x4a/0x70
> >> [   64.587040]  ? do_pci_enable_device+0xdb/0x110
> >> [   64.587043]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
> >> [   64.587277]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
> >> [   64.587509]  local_pci_probe+0x4b/0x90
> >> [   64.587512]  work_for_cpu_fn+0x1a/0x30
> >> [   64.587515]  process_one_work+0x22b/0x3d0
> >> [   64.587519]  worker_thread+0x21d/0x3f0
> >> [   64.587523]  ? process_one_work+0x3d0/0x3d0
> >> [   64.587526]  kthread+0x12a/0x150
> >> [   64.587529]  ? set_kthread_struct+0x50/0x50
> >> [   64.587532]  ret_from_fork+0x22/0x30
> >> [   64.587537]  </TASK>
> >> [   64.587619] amdgpu: Detected AMDGPU DF Counters. # of Counters = 8.
> >> [   64.587663] amdgpu: Detected AMDGPU 2 Perf Events.
> >> [   64.588081] [drm] Initialized amdgpu 3.48.0 20150101 for
> >> 0000:86:00.0 on minor 2
> >>
> >> Then driver unload (reference stuck at 2):
> >> [  110.117018] amdgpu 0000:86:00.0: amdgpu: amdgpu: finishing device.
> >> [  110.131638] [drm] free PSP TMR buffer
> >> [  110.420529] amdgpu: [dbg_xgmi_hive_put] ref_count 4
> >> [  110.420537] CPU: 27 PID: 1748 Comm: modprobe Tainted: G
> >> OE     5.15.0-46-generic #49~20.04.1-Ubuntu
> >> [  110.420545] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1
> >> 09/14/2018
> >> [  110.420548] Call Trace:
> >> [  110.420551]  <TASK>
> >> [  110.420556]  dump_stack_lvl+0x4a/0x63
> >> [  110.420569]  dump_stack+0x10/0x16
> >> [  110.420578]  amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
> >> [  110.421001]  amdgpu_xgmi_remove_device+0x11d/0x1c0 [amdgpu]
> >> [  110.421380]  amdgpu_device_fini_sw+0x63/0x4c0 [amdgpu]
> >> [  110.421724]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
> >> [  110.422070]  drm_dev_release+0x28/0x50 [drm]
> >> [  110.422145]  devm_drm_dev_init_release+0x38/0x60 [drm]
> >> [  110.422190]  devm_action_release+0x15/0x20
> >> [  110.422198]  release_nodes+0x40/0xb0
> >> [  110.422205]  devres_release_all+0x9e/0xe0
> >> [  110.422212]  device_release_driver_internal+0x117/0x1f0
> >> [  110.422218]  driver_detach+0x4c/0xa0
> >> [  110.422222]  bus_remove_driver+0x6c/0xf0
> >> [  110.422227]  driver_unregister+0x31/0x50
> >> [  110.422231]  pci_unregister_driver+0x40/0x90
> >> [  110.422238]  amdgpu_exit+0x15/0x446 [amdgpu]
> >> [  110.422791]  __x64_sys_delete_module+0x14e/0x260
> >> [  110.422801]  ? do_syscall_64+0x69/0xc0
> >> [  110.422809]  ? __x64_sys_read+0x1a/0x20
> >> [  110.422817]  ? do_syscall_64+0x69/0xc0
> >> [  110.422821]  ? ksys_read+0x67/0xf0
> >> [  110.422825]  do_syscall_64+0x5c/0xc0
> >> [  110.422830]  ? __x64_sys_read+0x1a/0x20
> >> [  110.422834]  ? do_syscall_64+0x69/0xc0
> >> [  110.422839]  ? syscall_exit_to_user_mode+0x27/0x50
> >> [  110.422846]  ? __x64_sys_openat+0x20/0x30
> >> [  110.422853]  ? do_syscall_64+0x69/0xc0
> >> [  110.422857]  ? do_syscall_64+0x69/0xc0
> >> [  110.422862]  ? irqentry_exit+0x1d/0x30
> >> [  110.422868]  ? exc_page_fault+0x89/0x170
> >> [  110.422874]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
> >> [  110.422885] RIP: 0033:0x7f1576682a6b
> >> [  110.422892] Code: 73 01 c3 48 8b 0d 25 c4 0c 00 f7 d8 64 89 01 48
> >> 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00
> >> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 c3 0c 00 f7 d8 64
> >> 89 01 48
> >> [  110.422897] RSP: 002b:00007ffcb96e0bf8 EFLAGS: 00000206 ORIG_RAX:
> >> 00000000000000b0
> >> [  110.422904] RAX: ffffffffffffffda RBX: 000056347ba57550 RCX:
> >> 00007f1576682a6b
> >> [  110.422908] RDX: 0000000000000000 RSI: 0000000000000800 RDI:
> >> 000056347ba575b8
> >> [  110.422911] RBP: 000056347ba57550 R08: 0000000000000000 R09:
> >> 0000000000000000
> >> [  110.422913] R10: 00007f15766feac0 R11: 0000000000000206 R12:
> >> 000056347ba575b8
> >> [  110.422916] R13: 0000000000000000 R14: 000056347ba575b8 R15:
> >> 000056347ba57550
> >> [  110.422921]  </TASK>
> >> [  110.425941] [drm] amdgpu: ttm finalized
> >> [  110.489186] amdgpu 0000:83:00.0: amdgpu: amdgpu: finishing device.
> >> [  110.504025] [drm] free PSP TMR buffer
> >> [  110.762272] amdgpu: [dbg_xgmi_hive_put] ref_count 3
> >> [  110.762280] CPU: 27 PID: 1748 Comm: modprobe Tainted: G
> >> OE     5.15.0-46-generic #49~20.04.1-Ubuntu
> >> [  110.762288] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1
> >> 09/14/2018
> >> [  110.762290] Call Trace:
> >> [  110.762294]  <TASK>
> >> [  110.762298]  dump_stack_lvl+0x4a/0x63
> >> [  110.762313]  dump_stack+0x10/0x16
> >> [  110.762319]  amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
> >> [  110.762663]  amdgpu_xgmi_remove_device+0x11d/0x1c0 [amdgpu]
> >> [  110.762965]  amdgpu_device_fini_sw+0x63/0x4c0 [amdgpu]
> >> [  110.763231]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
> >> [  110.763519]  drm_dev_release+0x28/0x50 [drm]
> >> [  110.763569]  devm_drm_dev_init_release+0x38/0x60 [drm]
> >> [  110.763609]  devm_action_release+0x15/0x20
> >> [  110.763617]  release_nodes+0x40/0xb0
> >> [  110.763624]  devres_release_all+0x9e/0xe0
> >> [  110.763631]  device_release_driver_internal+0x117/0x1f0
> >> [  110.763636]  driver_detach+0x4c/0xa0
> >> [  110.763640]  bus_remove_driver+0x6c/0xf0
> >> [  110.763646]  driver_unregister+0x31/0x50
> >> [  110.763650]  pci_unregister_driver+0x40/0x90
> >> [  110.763657]  amdgpu_exit+0x15/0x446 [amdgpu]
> >> [  110.764153]  __x64_sys_delete_module+0x14e/0x260
> >> [  110.764164]  ? do_syscall_64+0x69/0xc0
> >> [  110.764172]  ? __x64_sys_read+0x1a/0x20
> >> [  110.764180]  ? do_syscall_64+0x69/0xc0
> >> [  110.764184]  ? ksys_read+0x67/0xf0
> >> [  110.764189]  do_syscall_64+0x5c/0xc0
> >> [  110.764193]  ? __x64_sys_read+0x1a/0x20
> >> [  110.764197]  ? do_syscall_64+0x69/0xc0
> >> [  110.764202]  ? syscall_exit_to_user_mode+0x27/0x50
> >> [  110.764209]  ? __x64_sys_openat+0x20/0x30
> >> [  110.764217]  ? do_syscall_64+0x69/0xc0
> >> [  110.764221]  ? do_syscall_64+0x69/0xc0
> >> [  110.764226]  ? irqentry_exit+0x1d/0x30
> >> [  110.764232]  ? exc_page_fault+0x89/0x170
> >> [  110.764238]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
> >> [  110.764248] RIP: 0033:0x7f1576682a6b
> >> [  110.764255] Code: 73 01 c3 48 8b 0d 25 c4 0c 00 f7 d8 64 89 01 48
> >> 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00
> >> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 c3 0c 00 f7 d8 64
> >> 89 01 48
> >> [  110.764260] RSP: 002b:00007ffcb96e0bf8 EFLAGS: 00000206 ORIG_RAX:
> >> 00000000000000b0
> >> [  110.764267] RAX: ffffffffffffffda RBX: 000056347ba57550 RCX:
> >> 00007f1576682a6b
> >> [  110.764270] RDX: 0000000000000000 RSI: 0000000000000800 RDI:
> >> 000056347ba575b8
> >> [  110.764273] RBP: 000056347ba57550 R08: 0000000000000000 R09:
> >> 0000000000000000
> >> [  110.764275] R10: 00007f15766feac0 R11: 0000000000000206 R12:
> >> 000056347ba575b8
> >> [  110.764278] R13: 0000000000000000 R14: 000056347ba575b8 R15:
> >> 000056347ba57550
> >> [  110.764283]  </TASK>
> >> [  110.764326] amdgpu: [dbg_xgmi_hive_put] ref_count 2
> >> [  110.764329] CPU: 27 PID: 1748 Comm: modprobe Tainted: G
> >> OE     5.15.0-46-generic #49~20.04.1-Ubuntu
> >> [  110.764334] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1
> >> 09/14/2018
> >> [  110.764336] Call Trace:
> >> [  110.764337]  <TASK>
> >> [  110.764339]  dump_stack_lvl+0x4a/0x63
> >> [  110.764347]  dump_stack+0x10/0x16
> >> [  110.764354]  amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
> >> [  110.764624]  amdgpu_xgmi_remove_device+0x1ad/0x1c0 [amdgpu]
> >> [  110.764791]  amdgpu_device_fini_sw+0x63/0x4c0 [amdgpu]
> >> [  110.764937]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
> >> [  110.765085]  drm_dev_release+0x28/0x50 [drm]
> >> [  110.765108]  devm_drm_dev_init_release+0x38/0x60 [drm]
> >> [  110.765130]  devm_action_release+0x15/0x20
> >> [  110.765134]  release_nodes+0x40/0xb0
> >> [  110.765137]  devres_release_all+0x9e/0xe0
> >> [  110.765141]  device_release_driver_internal+0x117/0x1f0
> >> [  110.765144]  driver_detach+0x4c/0xa0
> >> [  110.765146]  bus_remove_driver+0x6c/0xf0
> >> [  110.765148]  driver_unregister+0x31/0x50
> >> [  110.765150]  pci_unregister_driver+0x40/0x90
> >> [  110.765154]  amdgpu_exit+0x15/0x446 [amdgpu]
> >> [  110.765434]  __x64_sys_delete_module+0x14e/0x260
> >> [  110.765438]  ? do_syscall_64+0x69/0xc0
> >> [  110.765441]  ? __x64_sys_read+0x1a/0x20
> >> [  110.765444]  ? do_syscall_64+0x69/0xc0
> >> [  110.765446]  ? ksys_read+0x67/0xf0
> >> [  110.765449]  do_syscall_64+0x5c/0xc0
> >> [  110.765451]  ? __x64_sys_read+0x1a/0x20
> >> [  110.765454]  ? do_syscall_64+0x69/0xc0
> >> [  110.765456]  ? syscall_exit_to_user_mode+0x27/0x50
> >> [  110.765460]  ? __x64_sys_openat+0x20/0x30
> >> [  110.765464]  ? do_syscall_64+0x69/0xc0
> >> [  110.765466]  ? do_syscall_64+0x69/0xc0
> >> [  110.765469]  ? irqentry_exit+0x1d/0x30
> >> [  110.765472]  ? exc_page_fault+0x89/0x170
> >> [  110.765476]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
> >> [  110.765480] RIP: 0033:0x7f1576682a6b
> >> [  110.765482] Code: 73 01 c3 48 8b 0d 25 c4 0c 00 f7 d8 64 89 01 48
> >> 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00
> >> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 c3 0c 00 f7 d8 64
> >> 89 01 48
> >> [  110.765485] RSP: 002b:00007ffcb96e0bf8 EFLAGS: 00000206 ORIG_RAX:
> >> 00000000000000b0
> >> [  110.765488] RAX: ffffffffffffffda RBX: 000056347ba57550 RCX:
> >> 00007f1576682a6b
> >> [  110.765489] RDX: 0000000000000000 RSI: 0000000000000800 RDI:
> >> 000056347ba575b8
> >> [  110.765491] RBP: 000056347ba57550 R08: 0000000000000000 R09:
> >> 0000000000000000
> >> [  110.765492] R10: 00007f15766feac0 R11: 0000000000000206 R12:
> >> 000056347ba575b8
> >> [  110.765494] R13: 0000000000000000 R14: 000056347ba575b8 R15:
> >> 000056347ba57550
> >> [  110.765496]  </TASK>
> >> [  110.768091] [drm] amdgpu: ttm finalized
> >>
> >>> -----Original Message-----
> >>> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
> >>> Sent: August 11, 2022 12:43 PM
> >>> To: Kim, Jonathan <Jonathan.Kim at amd.com>; Kuehling, Felix
> >>> <Felix.Kuehling at amd.com>; amd-gfx at lists.freedesktop.org
> >>> Subject: Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info
> >>> reference
> >>> leak
> >>>
> >>>
> >>> On 2022-08-11 11:34, Kim, Jonathan wrote:
> >>>> [Public]
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Kuehling, Felix <Felix.Kuehling at amd.com>
> >>>>> Sent: August 11, 2022 11:19 AM
> >>>>> To: amd-gfx at lists.freedesktop.org; Kim, Jonathan
> >>> <Jonathan.Kim at amd.com>
> >>>>> Subject: Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info
> >>>>> reference
> >>>>> leak
> >>>>>
> >>>>> Am 2022-08-11 um 09:42 schrieb Jonathan Kim:
> >>>>>> When an xgmi node is added to the hive, it takes another hive
> >>>>>> reference for its reset domain.
> >>>>>>
> >>>>>> This extra reference was not dropped on device removal from the
> >>>>>> hive so drop it.
> >>>>>>
> >>>>>> Signed-off-by: Jonathan Kim <jonathan.kim at amd.com>
> >>>>>> ---
> >>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 3 +++
> >>>>>>     1 file changed, 3 insertions(+)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> >>>>>> index 1b108d03e785..560bf1c98f08 100644
> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> >>>>>> @@ -731,6 +731,9 @@ int amdgpu_xgmi_remove_device(struct
> >>>>> amdgpu_device *adev)
> >>>>>> mutex_unlock(&hive->hive_lock);
> >>>>>>
> >>>>>>       amdgpu_put_xgmi_hive(hive);
> >>>>>> +   /* device is removed from the hive so remove its reset domain
> >>>>> reference */
> >>>>>> +   if (adev->reset_domain && adev->reset_domain == hive-
> >>>>>> reset_domain)
> >>>>>> +           amdgpu_put_xgmi_hive(hive);
> >>>>> This is some messed up reference counting. If you need an extra
> >>>>> reference from the reset_domain to the hive, that should be owned
> >>>>> by the
> >>>>> reset_domain and dropped when the reset_domain is destroyed. And it's
> >>>>> only one reference for the reset_domain, not one reference per
> >>>>> adev in
> >>>>> the reset_domain.
> >>>> Cc'ing Andrey.
> >>>>
> >>>> What you're saying seems to make more sense to me, but what I got
> >>>> from an
> >>> offline conversation with Andrey
> >>>> was that the reset domain reference per device was intentional.
> >>>> Maybe Andrey can comment here.
> >>>>
> >>>>> What you're doing here looks like every adev that's in a
> >>>>> reset_domain of
> >>>>> its hive has two references to the hive. And if you're dropping the
> >>>>> extra reference here, it still leaves the reset_domain with a
> >>>>> dangling
> >>>>> pointer to a hive that may no longer exist. So this extra
> >>>>> reference is
> >>>>> kind of pointless.
> >>>
> >>> reset_domain doesn't have any references to the hive, the hive has a
> >>> reference to reset_domain
> >>>
> >>>
> >>>> Yes.  Currently one reference is fetched from the device's lifetime
> >>>> on the hive
> >>> and the other is from the
> >>>> per-device reset domain.
> >>>>
> >>>> Snippet from amdgpu_device_ip_init:
> >>>>           /**
> >>>>            * In case of XGMI grab extra reference for reset domain
> >>>> for this device
> >>>>            */
> >>>>           if (adev->gmc.xgmi.num_physical_nodes > 1) {
> >>>>                   if (amdgpu_xgmi_add_device(adev) == 0) { <- [JK]
> >>>> reference is
> >>> fetched here
> >>>
> >>>
> >>> amdgpu_xgmi_add_device calls  amdgpu_get_xgmi_hive and only on the
> >>> first
> >>> time amdgpu_get_xgmi_hive is called and hive is actually allocated and
> >>> initialized  will we proceed
> >>> to creating the reset domain either from scratch (first creation of the
> >>> hive) or by taking reference from adev (see [1])
> >>>
> >>>
> >>>
> >>> [1] -
> >>>
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/a
> >>>
> >>> mdgpu_xgmi.c#L394
> >>>
> >>>>                           struct amdgpu_hive_info *hive =
> >>>> amdgpu_get_xgmi_hive(adev);
> >>> <- [JK] then here again
> >>>
> >>>
> >>> So here I don't see how an extra reference to reset_domain is taken if
> >>> amdgpu_get_xgmi_hive returns early since the hive already created and
> >>> exists in the global hive container ?
> >>>
> >>> Johantan - can u please show the exact flow how recount leak on
> >>> reset_domain is happening ?
> >>>
> >>> Andrey
> >>>
> >>>
> >>>>                           if (!hive->reset_domain ||
> >>>> !amdgpu_reset_get_reset_domain(hive->reset_domain)) {
> >>>>                                   r = -ENOENT;
> >>>>                                   goto init_failed;
> >>>>                           }
> >>>>
> >>>>                           /* Drop the early temporary reset domain
> >>>> we created for device
> >>> */
> >>>> amdgpu_reset_put_reset_domain(adev->reset_domain);
> >>>>                           adev->reset_domain = hive->reset_domain;
> >>>>                   }
> >>>>           }
> >>>>
> >>>> One of these never gets dropped so a leak happens.
> >>>> So either the extra reference has to be dropped on device removal
> >>>> from the
> >>> hive or from what you've mentioned,
> >>>> the reset_domain reference fetch should be fixed to grab at the
> >>> hive/reset_domain level.
> >>>> Thanks,
> >>>>
> >>>> Jon
> >>>>
> >>>>> Regards,
> >>>>>      Felix
> >>>>>
> >>>>>
> >>>>>>       adev->hive = NULL;
> >>>>>>
> >>>>>>       if (atomic_dec_return(&hive->number_devices) == 0) {


More information about the amd-gfx mailing list