How to get useful information other than "the whole system locks up"?
Braiam
braiamp at gmail.com
Sat Apr 6 19:25:40 UTC 2019
Managed to extract more information from one of the lock up:
[53798.406792] [drm:amdgpu_ctx_mgr_entity_fini [amdgpu]] *ERROR* ctx
00000000865ddb9b is still alive
[53798.406862] [drm:amdgpu_ctx_mgr_fini [amdgpu]] *ERROR* ctx
00000000865ddb9b is still alive
Apparently the messages above have nothing to do, since there's a jump
on the timestamp.
[135442.607022] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
timeout, signaled seq=834761, emitted seq=834763
[135442.607111] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
information: process pid 0 thread pid 0
[135442.607117] amdgpu 0000:01:00.0: GPU reset begin!
[135442.617041] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx
timeout, signaled seq=11762398, emitted seq=11762400
[135442.617133] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
information: process Xorg pid 737 thread Xorg:cs0 pid 768
[135442.617139] amdgpu 0000:01:00.0: GPU reset begin!
Why the GPU tries to reset twice?
[135443.319047] amdgpu: [powerplay]
failed to send message 5e ret is 0
[135443.806941] amdgpu: [powerplay]
failed to send message 145 ret is 0
[135444.294821] amdgpu: [powerplay]
failed to send message 146 ret is 0
[135444.782666] amdgpu: [powerplay]
failed to send message 148 ret is 0
[135445.270683] amdgpu: [powerplay]
failed to send message 145 ret is 0
[135445.759529] amdgpu: [powerplay]
failed to send message 146 ret is 0
[135445.839244] [drm] REG_WAIT timeout 10us * 3500 tries -
dce_mi_free_dmif line:636
Here, it start a trace. Should I report a bug even if I'm unable to
reproduce the conditions that led to it?
[135445.839340] WARNING: CPU: 1 PID: 24740 at
drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:277
generic_reg_wait.cold.3+0x2a/0x31 [amdgpu]
[135445.839341] Modules linked in: uinput nft_chain_route_ipv4
xt_CHECKSUM nft_chain_nat ipt_MASQUERADE nf_nat xt_conntrack
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4
nft_counter xt_tcpudp nft_compat tun bridge stp llc nf_tables
nfnetlink cpufreq_userspace cpufreq_powersave cpufreq_conservative
edac_mce_amd kvm_amd ccp rng_core kvm irqbypass crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel binfmt_misc nls_ascii nls_cp437 vfat
fat efi_pstore aesni_intel eeepc_wmi asus_wmi aes_x86_64 sparse_keymap
crypto_simd rfkill cryptd snd_hda_codec_realtek glue_helper
snd_hda_codec_generic pcspkr wmi_bmof joydev efivars k10temp
ledtrig_audio snd_hda_codec_hdmi sp5100_tco amdgpu snd_hda_intel chash
snd_hda_codec gpu_sched snd_hda_core ttm snd_hwdep snd_pcm sg
snd_timer drm_kms_helper snd soundcore drm evdev button pcc_cpufreq
acpi_cpufreq parport_pc ppdev lp parport efivarfs ip_tables x_tables
autofs4 btrfs xor zstd_decompress zstd_compress raid6_pq libcrc32c
crc32c_generic sd_mod
[135445.839368] hid_generic usbhid hid ahci libahci igb xhci_pci
crc32c_intel i2c_algo_bit dca libata xhci_hcd i2c_piix4 scsi_mod
usbcore wmi video gpio_amdpt gpio_generic
[135445.839377] CPU: 1 PID: 24740 Comm: kworker/1:0 Not tainted 5.1.0-rc3 #1
[135445.839378] Hardware name: System manufacturer System Product
Name/ROG STRIX B350-F GAMING, BIOS 4207 12/07/2018
[135445.839383] Workqueue: events drm_sched_job_timedout [gpu_sched]
[135445.839451] RIP: 0010:generic_reg_wait.cold.3+0x2a/0x31 [amdgpu]
[135445.839453] Code: 44 8b 44 24 68 48 8b 4c 24 60 48 c7 c7 a8 a2 8b
c0 8b 54 24 58 8b 74 24 04 e8 3a 88 8c e4 41 83 7c 24 20 01 0f 84 51
8d fe ff <0f> 0b e9 4a 8d fe ff e8 84 61 ed ff 48 c7 c7 00 80 95 c0 e8
a8 0f
[135445.839454] RSP: 0018:ffffb8d7099777b0 EFLAGS: 00010297
[135445.839455] RAX: 0000000000000044 RBX: 0000000000000dad RCX:
0000000000000000
[135445.839456] RDX: 0000000000000000 RSI: ffff8db08ea56638 RDI:
ffff8db08ea56638
[135445.839457] RBP: 0000000000000328 R08: 00000000000003fd R09:
0000000000000004
[135445.839458] R10: 0000000000000000 R11: 0000000000000001 R12:
ffff8db085256600
[135445.839459] R13: ffff8db086c4f000 R14: 0000000000000002 R15:
0000000000000010
[135445.839460] FS: 0000000000000000(0000) GS:ffff8db08ea40000(0000)
knlGS:0000000000000000
[135445.839461] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[135445.839462] CR2: 00007fe3feab6000 CR3: 00000001c0df4000 CR4:
00000000003406e0
[135445.839463] Call Trace:
[135445.839539] dce_mi_free_dmif+0xff/0x1e0 [amdgpu]
[135445.839607] dce110_reset_hw_ctx_wrap+0x149/0x1e0 [amdgpu]
[135445.839674] dce110_apply_ctx_to_hw+0x4b/0x490 [amdgpu]
[135445.839728] ? amdgpu_pm_compute_clocks+0xb9/0x5d0 [amdgpu]
[135445.839800] ? dm_pp_apply_display_requirements+0x196/0x1a0 [amdgpu]
[135445.839864] dc_commit_state+0x2ae/0x550 [amdgpu]
[135445.839936] ? dm_determine_update_type_for_commit+0x5a/0x410 [amdgpu]
[135445.840006] amdgpu_dm_atomic_commit_tail+0x1fb/0x1930 [amdgpu]
[135445.840011] ? kmalloc_order+0x14/0x30
[135445.840085] ? dm_determine_update_type_for_commit+0x5a/0x410 [amdgpu]
[135445.840088] ? _cond_resched+0x15/0x30
[135445.840090] ? _cond_resched+0x15/0x30
[135445.840092] ? __ww_mutex_lock.isra.9+0x3a/0x710
[135445.840101] ? drm_atomic_helper_setup_commit+0x380/0x480 [drm_kms_helper]
[135445.840104] ? __wake_up_common_lock+0x89/0xc0
[135445.840105] ? _cond_resched+0x15/0x30
[135445.840107] ? wait_for_completion_timeout+0x3b/0x1a0
[135445.840181] ? amdgpu_dm_atomic_commit_tail+0x1930/0x1930 [amdgpu]
[135445.840254] ? dm_plane_helper_prepare_fb+0x257/0x360 [amdgpu]
[135445.840263] ? commit_tail+0x3d/0x70 [drm_kms_helper]
[135445.840271] commit_tail+0x3d/0x70 [drm_kms_helper]
[135445.840279] drm_atomic_helper_commit+0xb4/0x120 [drm_kms_helper]
[135445.840285] drm_atomic_helper_disable_all+0x14d/0x160 [drm_kms_helper]
[135445.840292] drm_atomic_helper_suspend+0xc4/0xf0 [drm_kms_helper]
[135445.840356] dm_suspend+0x1c/0x60 [amdgpu]
[135445.840402] amdgpu_device_ip_suspend_phase1+0x9a/0xd0 [amdgpu]
[135445.840448] amdgpu_device_ip_suspend+0x1b/0x60 [amdgpu]
[135445.840508] amdgpu_device_pre_asic_reset+0x1ef/0x204 [amdgpu]
[135445.840567] amdgpu_device_gpu_recover+0x7b/0x743 [amdgpu]
[135445.840627] amdgpu_job_timedout+0xfc/0x120 [amdgpu]
[135445.840631] drm_sched_job_timedout+0x39/0x60 [gpu_sched]
[135445.840634] process_one_work+0x1a7/0x3b0
[135445.840636] worker_thread+0x30/0x390
[135445.840638] ? create_worker+0x1a0/0x1a0
[135445.840639] kthread+0x112/0x130
[135445.840640] ? __kthread_parkme+0x70/0x70
[135445.840642] ret_from_fork+0x22/0x40
[135445.840645] ---[ end trace 660b6e2adb0f7476 ]---
Later, powerplay seems to fail a bunch of time, with messages ranging
from 5e to 186
[135446.364392] amdgpu: [powerplay]
failed to send message 148 ret is 0
[135446.887086] amdgpu: [powerplay]
failed to send message 145 ret is 0
[135447.409654] amdgpu: [powerplay]
failed to send message 146 ret is 0
[135447.933340] amdgpu: [powerplay]
failed to send message 133 ret is 0
[135447.933342] amdgpu: [powerplay] VI should always have 2 performance levels
[135448.456153] amdgpu: [powerplay]
failed to send message 5e ret is 0
[135448.978944] amdgpu: [powerplay]
failed to send message 145 ret is 0
[135449.501800] amdgpu: [powerplay]
failed to send message 146 ret is 0
[135450.024313] amdgpu: [powerplay]
failed to send message 148 ret is 0
[135450.546863] amdgpu: [powerplay]
failed to send message 145 ret is 0
[135451.069831] amdgpu: [powerplay]
failed to send message 146 ret is 0
[135451.592819] amdgpu: [powerplay]
failed to send message 16a ret is 0
[135452.115338] amdgpu: [powerplay]
failed to send message 186 ret is 0
[135452.637920] amdgpu: [powerplay]
failed to send message 54 ret is 0
[135453.160464] amdgpu: [powerplay]
failed to send message 13d ret is 0
[135453.682876] amdgpu: [powerplay]
failed to send message 14f ret is 0
[135454.205768] amdgpu: [powerplay]
failed to send message 151 ret is 0
[135454.728890] amdgpu: [powerplay]
failed to send message 135 ret is 0
[135455.252057] amdgpu: [powerplay]
failed to send message 190 ret is 0
[135455.774989] amdgpu: [powerplay]
failed to send message 63 ret is 0
[135456.297489] amdgpu: [powerplay]
failed to send message 84 ret is 0
There seems to be a newline somewhere, since powerplay messages aren't
on the same line as the timestamp. After that, the gpu tries to
recover, but fails miserably:
[135456.299369] amdgpu 0000:01:00.0: GPU pci config reset
[135456.306844] amdgpu 0000:01:00.0: GPU reset succeeded, trying to resume
[135456.306964] [drm] PCIE gen 3 link speeds already enabled
[135456.310683] [drm] PCIE GART of 1024M enabled (table at 0x000000F400300000).
[135456.310768] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost!
[135456.590097] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper
[amdgpu]] *ERROR* ring gfx test failed (-110)
[135456.590145] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR*
resume of IP block <gfx_v7_0> failed -110
[135456.590169] amdgpu 0000:01:00.0: GPU reset(2) failed
This line gets repeated about 10 times:
[135456.590183] [drm] Skip scheduling IBs!
And the kernel tries to reset the gpu again:
[135456.590362] amdgpu 0000:01:00.0: GPU reset end with ret = -110
[135466.655925] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
timeout, signaled seq=834763, emitted seq=834763
[135466.656025] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
information: process pid 0 thread pid 0
[135466.656032] amdgpu 0000:01:00.0: GPU reset begin!
[135578.700184] INFO: task kworker/0:2:24837 blocked for more than 120 seconds.
[135578.700189] Tainted: G W 5.1.0-rc3 #1
[135578.700191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[135578.700194] kworker/0:2 D 0 24837 2 0x80000000
[135578.700206] Workqueue: events drm_sched_job_timedout [gpu_sched]
[135578.700208] Call Trace:
[135578.700217] ? __schedule+0x2d4/0x870
[135578.700220] ? __switch_to_asm+0x34/0x70
[135578.700223] ? __switch_to_asm+0x40/0x70
[135578.700225] schedule+0x28/0x70
[135578.700228] schedule_timeout+0x26d/0x390
[135578.700230] ? __schedule+0x2dc/0x870
[135578.700234] dma_fence_default_wait+0x238/0x2a0
[135578.700236] ? dma_fence_release+0x90/0x90
[135578.700239] dma_fence_wait_timeout+0xdd/0x100
[135578.700243] drm_sched_stop+0xf2/0x130 [gpu_sched]
[135578.700340] amdgpu_device_pre_asic_reset+0x3f/0x204 [amdgpu]
[135578.700428] amdgpu_device_gpu_recover+0x7b/0x743 [amdgpu]
[135578.700514] amdgpu_job_timedout+0xfc/0x120 [amdgpu]
[135578.700571] drm_sched_job_timedout+0x39/0x60 [gpu_sched]
[135578.700576] process_one_work+0x1a7/0x3b0
[135578.700578] worker_thread+0x30/0x390
[135578.700581] ? create_worker+0x1a0/0x1a0
[135578.700583] kthread+0x112/0x130
[135578.700585] ? __kthread_parkme+0x70/0x70
[135578.700587] ret_from_fork+0x22/0x40
[135578.700594] INFO: task kworker/0:1:29946 blocked for more than 120 seconds.
[135578.700596] Tainted: G W 5.1.0-rc3 #1
[135578.700598] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[135578.700599] kworker/0:1 D 0 29946 2 0x80000000
[135578.700605] Workqueue: events drm_sched_job_finish [gpu_sched]
[135578.700606] Call Trace:
[135578.700609] ? __schedule+0x2d4/0x870
[135578.700611] schedule+0x28/0x70
[135578.700613] schedule_timeout+0x26d/0x390
[135578.700616] wait_for_completion+0x11f/0x190
[135578.700619] ? wake_up_q+0x80/0x80
[135578.700622] __flush_work+0x130/0x1f0
[135578.700624] ? worker_attach_to_pool+0x90/0x90
[135578.700626] ? get_work_pool+0x40/0x40
[135578.700629] __cancel_work_timer+0x10a/0x190
[135578.700632] ? __switch_to+0x15b/0x460
[135578.700634] ? __switch_to_asm+0x34/0x70
[135578.700636] ? __switch_to_asm+0x40/0x70
[135578.700639] drm_sched_job_finish+0x25/0x60 [gpu_sched]
[135578.700642] process_one_work+0x1a7/0x3b0
[135578.700644] worker_thread+0x30/0x390
[135578.700647] ? create_worker+0x1a0/0x1a0
[135578.700648] kthread+0x112/0x130
[135578.700650] ? __kthread_parkme+0x70/0x70
[135578.700652] ret_from_fork+0x22/0x40
[135699.405742] INFO: task kworker/0:2:24837 blocked for more than 241 seconds.
[135699.405748] Tainted: G W 5.1.0-rc3 #1
[135699.405750] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[135699.405752] kworker/0:2 D 0 24837 2 0x80000000
[135699.405764] Workqueue: events drm_sched_job_timedout [gpu_sched]
[135699.405765] Call Trace:
[135699.405773] ? __schedule+0x2d4/0x870
[135699.405776] ? __switch_to_asm+0x34/0x70
[135699.405779] ? __switch_to_asm+0x40/0x70
[135699.405782] schedule+0x28/0x70
[135699.405785] schedule_timeout+0x26d/0x390
[135699.405787] ? __schedule+0x2dc/0x870
[135699.405791] dma_fence_default_wait+0x238/0x2a0
[135699.405794] ? dma_fence_release+0x90/0x90
[135699.405796] dma_fence_wait_timeout+0xdd/0x100
[135699.405800] drm_sched_stop+0xf2/0x130 [gpu_sched]
[135699.405900] amdgpu_device_pre_asic_reset+0x3f/0x204 [amdgpu]
[135699.405991] amdgpu_device_gpu_recover+0x7b/0x743 [amdgpu]
[135699.406081] amdgpu_job_timedout+0xfc/0x120 [amdgpu]
[135699.406086] drm_sched_job_timedout+0x39/0x60 [gpu_sched]
[135699.406089] process_one_work+0x1a7/0x3b0
[135699.406092] worker_thread+0x30/0x390
[135699.406095] ? create_worker+0x1a0/0x1a0
[135699.406096] kthread+0x112/0x130
[135699.406098] ? __kthread_parkme+0x70/0x70
[135699.406101] ret_from_fork+0x22/0x40
[135699.406108] INFO: task kworker/0:1:29946 blocked for more than 241 seconds.
[135699.406110] Tainted: G W 5.1.0-rc3 #1
[135699.406111] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[135699.406113] kworker/0:1 D 0 29946 2 0x80000000
[135699.406118] Workqueue: events drm_sched_job_finish [gpu_sched]
[135699.406119] Call Trace:
[135699.406122] ? __schedule+0x2d4/0x870
[135699.406125] schedule+0x28/0x70
[135699.406127] schedule_timeout+0x26d/0x390
[135699.406130] wait_for_completion+0x11f/0x190
[135699.406133] ? wake_up_q+0x80/0x80
[135699.406136] __flush_work+0x130/0x1f0
[135699.406138] ? worker_attach_to_pool+0x90/0x90
[135699.406141] ? get_work_pool+0x40/0x40
[135699.406143] __cancel_work_timer+0x10a/0x190
[135699.406145] ? __switch_to+0x15b/0x460
[135699.406148] ? __switch_to_asm+0x34/0x70
[135699.406150] ? __switch_to_asm+0x40/0x70
[135699.406153] drm_sched_job_finish+0x25/0x60 [gpu_sched]
[135699.406156] process_one_work+0x1a7/0x3b0
[135699.406158] worker_thread+0x30/0x390
[135699.406161] ? create_worker+0x1a0/0x1a0
[135699.406162] kthread+0x112/0x130
[135699.406164] ? __kthread_parkme+0x70/0x70
[135699.406167] ret_from_fork+0x22/0x40
[135699.406170] INFO: task kworker/2:0:29950 blocked for more than 120 seconds.
[135699.406171] Tainted: G W 5.1.0-rc3 #1
[135699.406173] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[135699.406174] kworker/2:0 D 0 29950 2 0x80000000
[135699.406179] Workqueue: events drm_sched_job_timedout [gpu_sched]
[135699.406180] Call Trace:
[135699.406183] ? __schedule+0x2d4/0x870
[135699.406185] schedule+0x28/0x70
[135699.406187] schedule_preempt_disabled+0xa/0x10
[135699.406190] __mutex_lock.isra.8+0x2d0/0x4a0
[135699.406277] amdgpu_device_lock_adev+0x15/0x37 [amdgpu]
[135699.406364] amdgpu_device_gpu_recover+0x69/0x743 [amdgpu]
[135699.406453] amdgpu_job_timedout+0xfc/0x120 [amdgpu]
[135699.406457] drm_sched_job_timedout+0x39/0x60 [gpu_sched]
[135699.406460] process_one_work+0x1a7/0x3b0
[135699.406463] worker_thread+0x30/0x390
[135699.406465] ? create_worker+0x1a0/0x1a0
[135699.406467] kthread+0x112/0x130
[135699.406468] ? __kthread_parkme+0x70/0x70
[135699.406471] ret_from_fork+0x22/0x40
At this point I decided that the system isn't going to recover, so I
tried to soft-reboot it, but it only ended locked and had to hard
reboot.
I've found this bug[1] that seems related, user watching videos. I
will try removing vaapi and check if that improves anything.
[1]: https://bugs.freedesktop.org/show_bug.cgi?id=106547
On Wed, Apr 3, 2019 at 6:38 PM Braiam <braiamp at gmail.com> wrote:
>
> On Wed, Apr 3, 2019 at 2:02 PM Alex Deucher <alexdeucher at gmail.com> wrote:
> >
> > On Wed, Apr 3, 2019 at 2:58 AM Braiam <braiamp at gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I have a Sapphire Technology Hawaii XT (R9 290X) using amdgpu driver
> > > with kernel 5.1.0-rc3.
> > > The issue happens with current 4.19.0 debian testing, 4.20-trunk,
> > > 5.0.0-trunk and rc2 and 3.
> > >
> > > It usually happens when I'm reproducing video, but I haven't figured
> > > out a way to reproduce it. It
> > > happened once without reproducing. I'm aware that the support is
> > > experimental, but radeon
> > > driver doesn't seems capable of direct rendering on this card dropping
> > > to llvmepipe.
> >
> > Radeon should work out of the box. Maybe something is messed up with
> > your install?
>
> Doubtful, since I used 2200G without issues. Only change was using amdgpu
> due games being particularly slow.
>
> >
> > >
> > > I had a ssh server installed in case I could log in while it crashes,
> > > and the only relevant
> > > line I found was:
> > >
> > > drm:amdgpu job timeout [amdgpu]] **ERROR** ring gfx timeout, signaled
> > > seq=399919, emitted seq=399921
> > >
> > > But that turned several bug reports which seems to have been fixed and
> > > the context and symptoms are too different to mine.
> > >
> >
> > You appear to be experiencing a GPU lockup. Unfortunately, there can
> > be many things that cause it, so it really helps to have a good
> > reproducer case. You might try a newer version of mesa or llvm. What
> > does your "reproducing video" work flow use? What apps, APIs are
> > involved?
>
> By "reproducing video" I mean watching videos from either Youtube/Netflix
> with Firefox or mpv for local files. But note that I've experienced at
> least one lock up
> without any video on screen (I don't remember if there was something paused
> on the background).
>
> Using mesa 18.3.4, and according to glxinfo DRM 3.30.0, LLVM 7.0.1. With
> vulkan, vdpau and va drivers on the same version. I gave up and removed
> any and every instance of the rocm package, since the module couldn't
> compile on the kernel.
>
> Another detail about the lock up: the screen doesn't freeze but no
> signal is emitted,
> so my monitor goes to sleep.
>
> Upgraded to mesa 19.0.1 and llvm 8.0.0, although I didn't see anything
> relevant on the
> changelogs that could affect my experience.
>
>
> >
> > Alex
> >
> > > I have tried forcing the amdgpu xorg driver with same results (was
> > > using radeon).
> > >
> > > --
> > > Braiam
> > > _______________________________________________
> > > amd-gfx mailing list
> > > amd-gfx at lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
> --
> Braiam
--
Braiam
More information about the amd-gfx
mailing list