Crash on device remove in drm_mode_config_cleanup
Andrey Grodzovsky
andrey.grodzovsky at amd.com
Mon Apr 26 20:24:22 UTC 2021
Daniel, Harry and Nick - in latest drm-misc-next (5.12.rc3) I am testing
for device unplug patches a user testing with eGPU box reported a crash
on unplug. I debugged myself a bit and I see that
drm_mode_config_cleanup is called twice - once explicitly from display
shutdown code and once as a callback from drm_managed_release.
Obliviously there is a problem here. What's the best way to fix this ?
root at andrey-test:~# echo 1 >
/sys/bus/pci/drivers/amdgpu/0000\:05\:00.0/remove
[ 37.068698 < 3.923109>] amdgpu 0000:05:00.0: amdgpu: amdgpu:
finishing device.
[ 37.081385 < 0.012687>] CPU: 1 PID: 2397 Comm: bash Tainted: G
B W OE 5.12.0-rc3-drm-misc-next+ #3
[ 37.081397 < 0.000012>] Hardware name: ASUS System Product
Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1004 08/13/2020
[ 37.081402 < 0.000005>] Call Trace:
[ 37.081407 < 0.000005>] dump_stack+0xa5/0xe6
[ 37.081419 < 0.000012>] drm_mode_config_cleanup.cold+0x5/0x4f [drm]
[ 37.081555 < 0.000136>] ? drm_mode_config_reset+0x220/0x220 [drm]
[ 37.081689 < 0.000134>] ? kfree+0xf3/0x3c0
[ 37.081699 < 0.000010>] amdgpu_dm_fini+0x73/0x230 [amdgpu]
[ 37.082541 < 0.000842>] dm_hw_fini+0x1e/0x30 [amdgpu]
[ 37.083404 < 0.000863>] amdgpu_device_fini_hw+0x38f/0x660 [amdgpu]
[ 37.084030 < 0.000626>] amdgpu_pci_remove+0x40/0x60 [amdgpu]
[ 37.084524 < 0.000494>] pci_device_remove+0x82/0x120
[ 37.084531 < 0.000007>] device_release_driver_internal+0x17b/0x2a0
[ 37.084537 < 0.000006>] ? sysfs_file_ops+0xa0/0xa0
[ 37.084541 < 0.000004>] pci_stop_bus_device+0xd5/0x100
[ 37.084547 < 0.000006>]
pci_stop_and_remove_bus_device_locked+0x16/0x30
[ 37.084552 < 0.000005>] remove_store+0xe7/0x100
[ 37.084557 < 0.000005>] ? subordinate_bus_number_show+0xc0/0xc0
[ 37.084563 < 0.000006>] ? __check_object_size+0x16b/0x480
[ 37.084572 < 0.000009>] ? sysfs_file_ops+0x76/0xa0
[ 37.084577 < 0.000005>] ? sysfs_kf_write+0x83/0xe0
[ 37.084582 < 0.000005>] kernfs_fop_write_iter+0x1ef/0x290
[ 37.084587 < 0.000005>] new_sync_write+0x253/0x370
[ 37.084591 < 0.000004>] ? new_sync_read+0x360/0x360
[ 37.084596 < 0.000005>] ? lockdep_hardirqs_on_prepare+0x210/0x210
[ 37.084603 < 0.000007>] ? __cond_resched+0x15/0x30
[ 37.084608 < 0.000005>] ? __inode_security_revalidate+0xa2/0xb0
[ 37.084614 < 0.000006>] ? __might_sleep+0x45/0xf0
[ 37.084620 < 0.000006>] vfs_write+0x3d7/0x4e0
[ 37.084624 < 0.000004>] ? ksys_write+0xe6/0x1a0
[ 37.084629 < 0.000005>] ksys_write+0xe6/0x1a0
[ 37.084633 < 0.000004>] ? __ia32_sys_read+0x60/0x60
[ 37.084638 < 0.000005>] ? lockdep_hardirqs_on_prepare+0xe/0x210
[ 37.084643 < 0.000005>] ? syscall_enter_from_user_mode+0x27/0x70
[ 37.084648 < 0.000005>] do_syscall_64+0x33/0x80
[ 37.084653 < 0.000005>] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 37.084658 < 0.000005>] RIP: 0033:0x7f576c3e01e7
[ 37.084663 < 0.000005>] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb
0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8
01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24
18 48 89 74 24
[ 37.084667 < 0.000004>] RSP: 002b:00007ffcf7b05948 EFLAGS:
00000246 ORIG_RAX: 0000000000000001
[ 37.084672 < 0.000005>] RAX: ffffffffffffffda RBX:
0000000000000002 RCX: 00007f576c3e01e7
[ 37.084675 < 0.000003>] RDX: 0000000000000002 RSI:
00005568ffe63d80 RDI: 0000000000000001
[ 37.084678 < 0.000003>] RBP: 00005568ffe63d80 R08:
000000000000000a R09: 0000000000000001
[ 37.084681 < 0.000003>] R10: 00005568ff9f3017 R11:
0000000000000246 R12: 0000000000000002
[ 37.084684 < 0.000003>] R13: 00007f576c4bb6a0 R14:
00007f576c4bc4a0 R15: 00007f576c4bb8a0
[ 37.400338 < 0.315654>] amdgpu 0000:05:00.0:
[drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test
failed (-110)
[ 37.401171 < 0.000833>] [drm] free PSP TMR buffer
[ 37.443240 < 0.042069>] [drm] amdgpu: ttm finalized
[ 37.443246 < 0.000006>] x86/PAT: bash:2397 freeing invalid memtype
[mem 0xd0000000-0xdfffffff]
[ 37.443945 < 0.000699>] CPU: 3 PID: 2397 Comm: bash Tainted: G
B W OE 5.12.0-rc3-drm-misc-next+ #3
[ 37.443952 < 0.000007>] Hardware name: ASUS System Product
Name/ROG STRIX B550-F GAMING (WI-FI), BIOS 1004 08/13/2020
[ 37.443956 < 0.000004>] Call Trace:
[ 37.443959 < 0.000003>] dump_stack+0xa5/0xe6
[ 37.443967 < 0.000008>] drm_mode_config_cleanup.cold+0x5/0x4f [drm]
[ 37.444048 < 0.000081>] ? drm_mode_config_reset+0x220/0x220 [drm]
[ 37.444129 < 0.000081>] ? drm_mode_config_cleanup+0x430/0x430 [drm]
[ 37.444208 < 0.000079>] drm_managed_release+0xf2/0x1c0 [drm]
[ 37.444287 < 0.000079>] drm_dev_release+0x4d/0x80 [drm]
[ 37.444363 < 0.000076>] release_nodes+0x373/0x3e0
[ 37.444371 < 0.000008>] ? devres_close_group+0x150/0x150
[ 37.444376 < 0.000005>] ? _raw_spin_lock_irqsave+0x6c/0xb0
[ 37.444382 < 0.000006>] ? devres_release_all+0x3f/0x90
[ 37.444388 < 0.000006>] device_release_driver_internal+0x18b/0x2a0
[ 37.444393 < 0.000005>] ? sysfs_file_ops+0xa0/0xa0
[ 37.444398 < 0.000005>] pci_stop_bus_device+0xd5/0x100
[ 37.444404 < 0.000006>]
pci_stop_and_remove_bus_device_locked+0x16/0x30
[ 37.444409 < 0.000005>] remove_store+0xe7/0x100
[ 37.444414 < 0.000005>] ? subordinate_bus_number_show+0xc0/0xc0
[ 37.444419 < 0.000005>] ? __check_object_size+0x16b/0x480
[ 37.444424 < 0.000005>] ? sysfs_file_ops+0x76/0xa0
[ 37.444428 < 0.000004>] ? sysfs_kf_write+0x83/0xe0
[ 37.444432 < 0.000004>] kernfs_fop_write_iter+0x1ef/0x290
[ 37.444437 < 0.000005>] new_sync_write+0x253/0x370
[ 37.444442 < 0.000005>] ? new_sync_read+0x360/0x360
[ 37.444447 < 0.000005>] ? lockdep_hardirqs_on_prepare+0x210/0x210
[ 37.444453 < 0.000006>] ? __cond_resched+0x15/0x30
[ 37.444457 < 0.000004>] ? __inode_security_revalidate+0xa2/0xb0
[ 37.444463 < 0.000006>] ? __might_sleep+0x45/0xf0
[ 37.444469 < 0.000006>] vfs_write+0x3d7/0x4e0
[ 37.444474 < 0.000005>] ? ksys_write+0xe6/0x1a0
[ 37.444478 < 0.000004>] ksys_write+0xe6/0x1a0
[ 37.444482 < 0.000004>] ? __ia32_sys_read+0x60/0x60
[ 37.444487 < 0.000005>] ? lockdep_hardirqs_on_prepare+0xe/0x210
[ 37.444492 < 0.000005>] ? syscall_enter_from_user_mode+0x27/0x70
[ 37.444496 < 0.000004>] do_syscall_64+0x33/0x80
[ 37.444502 < 0.000006>] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 37.444507 < 0.000005>] RIP: 0033:0x7f576c3e01e7
[ 37.444511 < 0.000004>] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb
0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8
01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24
18 48 89 74 24
[ 37.444515 < 0.000004>] RSP: 002b:00007ffcf7b05948 EFLAGS:
00000246 ORIG_RAX: 0000000000000001
[ 37.444520 < 0.000005>] RAX: ffffffffffffffda RBX:
0000000000000002 RCX: 00007f576c3e01e7
[ 37.444524 < 0.000004>] RDX: 0000000000000002 RSI:
00005568ffe63d80 RDI: 0000000000000001
[ 37.444527 < 0.000003>] RBP: 00005568ffe63d80 R08:
000000000000000a R09: 0000000000000001
[ 37.444529 < 0.000002>] R10: 00005568ff9f3017 R11:
0000000000000246 R12: 0000000000000002
[ 37.444532 < 0.000003>] R13: 00007f576c4bb6a0 R14:
00007f576c4bc4a0 R15: 00007f576c4bb8a0
[ 37.572043 < 0.127511>] AMD-Vi: Completion-Wait loop timed out
[ 37.572152 < 0.000109>] pci 0000:05:00.0: Removing from iommu group 13
More information about the amd-gfx
mailing list