Crash on device remove in drm_mode_config_cleanup

Andrey Grodzovsky andrey.grodzovsky at amd.com
Mon Apr 26 20:24:22 UTC 2021


Daniel, Harry and Nick - in latest drm-misc-next (5.12.rc3) I am testing 
for device unplug patches a user testing with eGPU box reported a crash 
on unplug. I debugged myself a bit and I see that 
drm_mode_config_cleanup is called twice - once explicitly from display 
shutdown code and once as a callback from drm_managed_release. 
Obliviously there is a problem here.  What's the best way to fix this ?

root at andrey-test:~# echo 1 > 
/sys/bus/pci/drivers/amdgpu/0000\:05\:00.0/remove
[   37.068698 <    3.923109>] amdgpu 0000:05:00.0: amdgpu: amdgpu: 
finishing device.
[   37.081385 <    0.012687>] CPU: 1 PID: 2397 Comm: bash Tainted: G 
B   W  OE     5.12.0-rc3-drm-misc-next+ #3
[   37.081397 <    0.000012>] Hardware name: ASUS System Product 
Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1004 08/13/2020
[   37.081402 <    0.000005>] Call Trace:
[   37.081407 <    0.000005>]  dump_stack+0xa5/0xe6
[   37.081419 <    0.000012>]  drm_mode_config_cleanup.cold+0x5/0x4f [drm]
[   37.081555 <    0.000136>]  ? drm_mode_config_reset+0x220/0x220 [drm]
[   37.081689 <    0.000134>]  ? kfree+0xf3/0x3c0
[   37.081699 <    0.000010>]  amdgpu_dm_fini+0x73/0x230 [amdgpu]
[   37.082541 <    0.000842>]  dm_hw_fini+0x1e/0x30 [amdgpu]
[   37.083404 <    0.000863>]  amdgpu_device_fini_hw+0x38f/0x660 [amdgpu]
[   37.084030 <    0.000626>]  amdgpu_pci_remove+0x40/0x60 [amdgpu]
[   37.084524 <    0.000494>]  pci_device_remove+0x82/0x120
[   37.084531 <    0.000007>]  device_release_driver_internal+0x17b/0x2a0
[   37.084537 <    0.000006>]  ? sysfs_file_ops+0xa0/0xa0
[   37.084541 <    0.000004>]  pci_stop_bus_device+0xd5/0x100
[   37.084547 <    0.000006>] 
pci_stop_and_remove_bus_device_locked+0x16/0x30
[   37.084552 <    0.000005>]  remove_store+0xe7/0x100
[   37.084557 <    0.000005>]  ? subordinate_bus_number_show+0xc0/0xc0
[   37.084563 <    0.000006>]  ? __check_object_size+0x16b/0x480
[   37.084572 <    0.000009>]  ? sysfs_file_ops+0x76/0xa0
[   37.084577 <    0.000005>]  ? sysfs_kf_write+0x83/0xe0
[   37.084582 <    0.000005>]  kernfs_fop_write_iter+0x1ef/0x290
[   37.084587 <    0.000005>]  new_sync_write+0x253/0x370
[   37.084591 <    0.000004>]  ? new_sync_read+0x360/0x360
[   37.084596 <    0.000005>]  ? lockdep_hardirqs_on_prepare+0x210/0x210
[   37.084603 <    0.000007>]  ? __cond_resched+0x15/0x30
[   37.084608 <    0.000005>]  ? __inode_security_revalidate+0xa2/0xb0
[   37.084614 <    0.000006>]  ? __might_sleep+0x45/0xf0
[   37.084620 <    0.000006>]  vfs_write+0x3d7/0x4e0
[   37.084624 <    0.000004>]  ? ksys_write+0xe6/0x1a0
[   37.084629 <    0.000005>]  ksys_write+0xe6/0x1a0
[   37.084633 <    0.000004>]  ? __ia32_sys_read+0x60/0x60
[   37.084638 <    0.000005>]  ? lockdep_hardirqs_on_prepare+0xe/0x210
[   37.084643 <    0.000005>]  ? syscall_enter_from_user_mode+0x27/0x70
[   37.084648 <    0.000005>]  do_syscall_64+0x33/0x80
[   37.084653 <    0.000005>]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   37.084658 <    0.000005>] RIP: 0033:0x7f576c3e01e7
[   37.084663 <    0.000005>] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 
0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 
01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 
18 48 89 74 24
[   37.084667 <    0.000004>] RSP: 002b:00007ffcf7b05948 EFLAGS: 
00000246 ORIG_RAX: 0000000000000001
[   37.084672 <    0.000005>] RAX: ffffffffffffffda RBX: 
0000000000000002 RCX: 00007f576c3e01e7
[   37.084675 <    0.000003>] RDX: 0000000000000002 RSI: 
00005568ffe63d80 RDI: 0000000000000001
[   37.084678 <    0.000003>] RBP: 00005568ffe63d80 R08: 
000000000000000a R09: 0000000000000001
[   37.084681 <    0.000003>] R10: 00005568ff9f3017 R11: 
0000000000000246 R12: 0000000000000002
[   37.084684 <    0.000003>] R13: 00007f576c4bb6a0 R14: 
00007f576c4bc4a0 R15: 00007f576c4bb8a0
[   37.400338 <    0.315654>] amdgpu 0000:05:00.0: 
[drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test 
failed (-110)
[   37.401171 <    0.000833>] [drm] free PSP TMR buffer
[   37.443240 <    0.042069>] [drm] amdgpu: ttm finalized
[   37.443246 <    0.000006>] x86/PAT: bash:2397 freeing invalid memtype 
[mem 0xd0000000-0xdfffffff]
[   37.443945 <    0.000699>] CPU: 3 PID: 2397 Comm: bash Tainted: G 
B   W  OE     5.12.0-rc3-drm-misc-next+ #3
[   37.443952 <    0.000007>] Hardware name: ASUS System Product 
Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1004 08/13/2020
[   37.443956 <    0.000004>] Call Trace:
[   37.443959 <    0.000003>]  dump_stack+0xa5/0xe6
[   37.443967 <    0.000008>]  drm_mode_config_cleanup.cold+0x5/0x4f [drm]
[   37.444048 <    0.000081>]  ? drm_mode_config_reset+0x220/0x220 [drm]
[   37.444129 <    0.000081>]  ? drm_mode_config_cleanup+0x430/0x430 [drm]
[   37.444208 <    0.000079>]  drm_managed_release+0xf2/0x1c0 [drm]
[   37.444287 <    0.000079>]  drm_dev_release+0x4d/0x80 [drm]
[   37.444363 <    0.000076>]  release_nodes+0x373/0x3e0
[   37.444371 <    0.000008>]  ? devres_close_group+0x150/0x150
[   37.444376 <    0.000005>]  ? _raw_spin_lock_irqsave+0x6c/0xb0
[   37.444382 <    0.000006>]  ? devres_release_all+0x3f/0x90
[   37.444388 <    0.000006>]  device_release_driver_internal+0x18b/0x2a0
[   37.444393 <    0.000005>]  ? sysfs_file_ops+0xa0/0xa0
[   37.444398 <    0.000005>]  pci_stop_bus_device+0xd5/0x100
[   37.444404 <    0.000006>] 
pci_stop_and_remove_bus_device_locked+0x16/0x30
[   37.444409 <    0.000005>]  remove_store+0xe7/0x100
[   37.444414 <    0.000005>]  ? subordinate_bus_number_show+0xc0/0xc0
[   37.444419 <    0.000005>]  ? __check_object_size+0x16b/0x480
[   37.444424 <    0.000005>]  ? sysfs_file_ops+0x76/0xa0
[   37.444428 <    0.000004>]  ? sysfs_kf_write+0x83/0xe0
[   37.444432 <    0.000004>]  kernfs_fop_write_iter+0x1ef/0x290
[   37.444437 <    0.000005>]  new_sync_write+0x253/0x370
[   37.444442 <    0.000005>]  ? new_sync_read+0x360/0x360
[   37.444447 <    0.000005>]  ? lockdep_hardirqs_on_prepare+0x210/0x210
[   37.444453 <    0.000006>]  ? __cond_resched+0x15/0x30
[   37.444457 <    0.000004>]  ? __inode_security_revalidate+0xa2/0xb0
[   37.444463 <    0.000006>]  ? __might_sleep+0x45/0xf0
[   37.444469 <    0.000006>]  vfs_write+0x3d7/0x4e0
[   37.444474 <    0.000005>]  ? ksys_write+0xe6/0x1a0
[   37.444478 <    0.000004>]  ksys_write+0xe6/0x1a0
[   37.444482 <    0.000004>]  ? __ia32_sys_read+0x60/0x60
[   37.444487 <    0.000005>]  ? lockdep_hardirqs_on_prepare+0xe/0x210
[   37.444492 <    0.000005>]  ? syscall_enter_from_user_mode+0x27/0x70
[   37.444496 <    0.000004>]  do_syscall_64+0x33/0x80
[   37.444502 <    0.000006>]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   37.444507 <    0.000005>] RIP: 0033:0x7f576c3e01e7
[   37.444511 <    0.000004>] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 
0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 
01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 
18 48 89 74 24
[   37.444515 <    0.000004>] RSP: 002b:00007ffcf7b05948 EFLAGS: 
00000246 ORIG_RAX: 0000000000000001
[   37.444520 <    0.000005>] RAX: ffffffffffffffda RBX: 
0000000000000002 RCX: 00007f576c3e01e7
[   37.444524 <    0.000004>] RDX: 0000000000000002 RSI: 
00005568ffe63d80 RDI: 0000000000000001
[   37.444527 <    0.000003>] RBP: 00005568ffe63d80 R08: 
000000000000000a R09: 0000000000000001
[   37.444529 <    0.000002>] R10: 00005568ff9f3017 R11: 
0000000000000246 R12: 0000000000000002
[   37.444532 <    0.000003>] R13: 00007f576c4bb6a0 R14: 
00007f576c4bc4a0 R15: 00007f576c4bb8a0
[   37.572043 <    0.127511>] AMD-Vi: Completion-Wait loop timed out
[   37.572152 <    0.000109>] pci 0000:05:00.0: Removing from iommu group 13


More information about the amd-gfx mailing list