How to get useful information other than "the whole system locks up"?

Sat Apr 6 19:25:40 UTC 2019

Managed to extract more information from one of the lock up:

[53798.406792] [drm:amdgpu_ctx_mgr_entity_fini [amdgpu]] *ERROR* ctx
00000000865ddb9b is still alive
[53798.406862] [drm:amdgpu_ctx_mgr_fini [amdgpu]] *ERROR* ctx
00000000865ddb9b is still alive

Apparently the messages above have nothing to do, since there's a jump
on the timestamp.

[135442.607022] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
timeout, signaled seq=834761, emitted seq=834763
[135442.607111] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
information: process  pid 0 thread  pid 0
[135442.607117] amdgpu 0000:01:00.0: GPU reset begin!
[135442.617041] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx
timeout, signaled seq=11762398, emitted seq=11762400
[135442.617133] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
information: process Xorg pid 737 thread Xorg:cs0 pid 768
[135442.617139] amdgpu 0000:01:00.0: GPU reset begin!

Why the GPU tries to reset twice?

[135443.319047] amdgpu: [powerplay]
                 failed to send message 5e ret is 0
[135443.806941] amdgpu: [powerplay]
                 failed to send message 145 ret is 0
[135444.294821] amdgpu: [powerplay]
                 failed to send message 146 ret is 0
[135444.782666] amdgpu: [powerplay]
                 failed to send message 148 ret is 0
[135445.270683] amdgpu: [powerplay]
                 failed to send message 145 ret is 0
[135445.759529] amdgpu: [powerplay]
                 failed to send message 146 ret is 0
[135445.839244] [drm] REG_WAIT timeout 10us * 3500 tries -
dce_mi_free_dmif line:636

Here, it start a trace. Should I report a bug even if I'm unable to
reproduce the conditions that led to it?

[135445.839340] WARNING: CPU: 1 PID: 24740 at
drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:277
generic_reg_wait.cold.3+0x2a/0x31 [amdgpu]
[135445.839341] Modules linked in: uinput nft_chain_route_ipv4
xt_CHECKSUM nft_chain_nat ipt_MASQUERADE nf_nat xt_conntrack
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4
nft_counter xt_tcpudp nft_compat tun bridge stp llc nf_tables
nfnetlink cpufreq_userspace cpufreq_powersave cpufreq_conservative
edac_mce_amd kvm_amd ccp rng_core kvm irqbypass crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel binfmt_misc nls_ascii nls_cp437 vfat
fat efi_pstore aesni_intel eeepc_wmi asus_wmi aes_x86_64 sparse_keymap
crypto_simd rfkill cryptd snd_hda_codec_realtek glue_helper
snd_hda_codec_generic pcspkr wmi_bmof joydev efivars k10temp
ledtrig_audio snd_hda_codec_hdmi sp5100_tco amdgpu snd_hda_intel chash
snd_hda_codec gpu_sched snd_hda_core ttm snd_hwdep snd_pcm sg
snd_timer drm_kms_helper snd soundcore drm evdev button pcc_cpufreq
acpi_cpufreq parport_pc ppdev lp parport efivarfs ip_tables x_tables
autofs4 btrfs xor zstd_decompress zstd_compress raid6_pq libcrc32c
crc32c_generic sd_mod
[135445.839368]  hid_generic usbhid hid ahci libahci igb xhci_pci
crc32c_intel i2c_algo_bit dca libata xhci_hcd i2c_piix4 scsi_mod
usbcore wmi video gpio_amdpt gpio_generic
[135445.839377] CPU: 1 PID: 24740 Comm: kworker/1:0 Not tainted 5.1.0-rc3 #1
[135445.839378] Hardware name: System manufacturer System Product
Name/ROG STRIX B350-F GAMING, BIOS 4207 12/07/2018
[135445.839383] Workqueue: events drm_sched_job_timedout [gpu_sched]
[135445.839451] RIP: 0010:generic_reg_wait.cold.3+0x2a/0x31 [amdgpu]
[135445.839453] Code: 44 8b 44 24 68 48 8b 4c 24 60 48 c7 c7 a8 a2 8b
c0 8b 54 24 58 8b 74 24 04 e8 3a 88 8c e4 41 83 7c 24 20 01 0f 84 51
8d fe ff <0f> 0b e9 4a 8d fe ff e8 84 61 ed ff 48 c7 c7 00 80 95 c0 e8
a8 0f
[135445.839454] RSP: 0018:ffffb8d7099777b0 EFLAGS: 00010297
[135445.839455] RAX: 0000000000000044 RBX: 0000000000000dad RCX:
0000000000000000
[135445.839456] RDX: 0000000000000000 RSI: ffff8db08ea56638 RDI:
ffff8db08ea56638
[135445.839457] RBP: 0000000000000328 R08: 00000000000003fd R09:
0000000000000004
[135445.839458] R10: 0000000000000000 R11: 0000000000000001 R12:
ffff8db085256600
[135445.839459] R13: ffff8db086c4f000 R14: 0000000000000002 R15:
0000000000000010
[135445.839460] FS:  0000000000000000(0000) GS:ffff8db08ea40000(0000)
knlGS:0000000000000000
[135445.839461] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[135445.839462] CR2: 00007fe3feab6000 CR3: 00000001c0df4000 CR4:
00000000003406e0
[135445.839463] Call Trace:
[135445.839539]  dce_mi_free_dmif+0xff/0x1e0 [amdgpu]
[135445.839607]  dce110_reset_hw_ctx_wrap+0x149/0x1e0 [amdgpu]
[135445.839674]  dce110_apply_ctx_to_hw+0x4b/0x490 [amdgpu]
[135445.839728]  ? amdgpu_pm_compute_clocks+0xb9/0x5d0 [amdgpu]
[135445.839800]  ? dm_pp_apply_display_requirements+0x196/0x1a0 [amdgpu]
[135445.839864]  dc_commit_state+0x2ae/0x550 [amdgpu]
[135445.839936]  ? dm_determine_update_type_for_commit+0x5a/0x410 [amdgpu]
[135445.840006]  amdgpu_dm_atomic_commit_tail+0x1fb/0x1930 [amdgpu]
[135445.840011]  ? kmalloc_order+0x14/0x30
[135445.840085]  ? dm_determine_update_type_for_commit+0x5a/0x410 [amdgpu]
[135445.840088]  ? _cond_resched+0x15/0x30
[135445.840090]  ? _cond_resched+0x15/0x30
[135445.840092]  ? __ww_mutex_lock.isra.9+0x3a/0x710
[135445.840101]  ? drm_atomic_helper_setup_commit+0x380/0x480 [drm_kms_helper]
[135445.840104]  ? __wake_up_common_lock+0x89/0xc0
[135445.840105]  ? _cond_resched+0x15/0x30
[135445.840107]  ? wait_for_completion_timeout+0x3b/0x1a0
[135445.840181]  ? amdgpu_dm_atomic_commit_tail+0x1930/0x1930 [amdgpu]
[135445.840254]  ? dm_plane_helper_prepare_fb+0x257/0x360 [amdgpu]
[135445.840263]  ? commit_tail+0x3d/0x70 [drm_kms_helper]
[135445.840271]  commit_tail+0x3d/0x70 [drm_kms_helper]
[135445.840279]  drm_atomic_helper_commit+0xb4/0x120 [drm_kms_helper]
[135445.840285]  drm_atomic_helper_disable_all+0x14d/0x160 [drm_kms_helper]
[135445.840292]  drm_atomic_helper_suspend+0xc4/0xf0 [drm_kms_helper]
[135445.840356]  dm_suspend+0x1c/0x60 [amdgpu]
[135445.840402]  amdgpu_device_ip_suspend_phase1+0x9a/0xd0 [amdgpu]
[135445.840448]  amdgpu_device_ip_suspend+0x1b/0x60 [amdgpu]
[135445.840508]  amdgpu_device_pre_asic_reset+0x1ef/0x204 [amdgpu]
[135445.840567]  amdgpu_device_gpu_recover+0x7b/0x743 [amdgpu]
[135445.840627]  amdgpu_job_timedout+0xfc/0x120 [amdgpu]
[135445.840631]  drm_sched_job_timedout+0x39/0x60 [gpu_sched]
[135445.840634]  process_one_work+0x1a7/0x3b0
[135445.840636]  worker_thread+0x30/0x390
[135445.840638]  ? create_worker+0x1a0/0x1a0
[135445.840639]  kthread+0x112/0x130
[135445.840640]  ? __kthread_parkme+0x70/0x70
[135445.840642]  ret_from_fork+0x22/0x40
[135445.840645] ---[ end trace 660b6e2adb0f7476 ]---

Later, powerplay seems to fail a bunch of time, with messages ranging
from 5e to 186

[135446.364392] amdgpu: [powerplay]
                 failed to send message 148 ret is 0
[135446.887086] amdgpu: [powerplay]
                 failed to send message 145 ret is 0
[135447.409654] amdgpu: [powerplay]
                 failed to send message 146 ret is 0
[135447.933340] amdgpu: [powerplay]
                 failed to send message 133 ret is 0
[135447.933342] amdgpu: [powerplay] VI should always have 2 performance levels
[135448.456153] amdgpu: [powerplay]
                 failed to send message 5e ret is 0
[135448.978944] amdgpu: [powerplay]
                 failed to send message 145 ret is 0
[135449.501800] amdgpu: [powerplay]
                 failed to send message 146 ret is 0
[135450.024313] amdgpu: [powerplay]
                 failed to send message 148 ret is 0
[135450.546863] amdgpu: [powerplay]
                 failed to send message 145 ret is 0
[135451.069831] amdgpu: [powerplay]
                 failed to send message 146 ret is 0
[135451.592819] amdgpu: [powerplay]
                 failed to send message 16a ret is 0
[135452.115338] amdgpu: [powerplay]
                 failed to send message 186 ret is 0
[135452.637920] amdgpu: [powerplay]
                 failed to send message 54 ret is 0
[135453.160464] amdgpu: [powerplay]
                 failed to send message 13d ret is 0
[135453.682876] amdgpu: [powerplay]
                 failed to send message 14f ret is 0
[135454.205768] amdgpu: [powerplay]
                 failed to send message 151 ret is 0
[135454.728890] amdgpu: [powerplay]
                 failed to send message 135 ret is 0
[135455.252057] amdgpu: [powerplay]
                 failed to send message 190 ret is 0
[135455.774989] amdgpu: [powerplay]
                 failed to send message 63 ret is 0
[135456.297489] amdgpu: [powerplay]
                 failed to send message 84 ret is 0

There seems to be a newline somewhere, since powerplay messages aren't
on the same line as the timestamp. After that, the gpu tries to
recover, but fails miserably:

[135456.299369] amdgpu 0000:01:00.0: GPU pci config reset
[135456.306844] amdgpu 0000:01:00.0: GPU reset succeeded, trying to resume
[135456.306964] [drm] PCIE gen 3 link speeds already enabled
[135456.310683] [drm] PCIE GART of 1024M enabled (table at 0x000000F400300000).
[135456.310768] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* VRAM is lost!
[135456.590097] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper
[amdgpu]] *ERROR* ring gfx test failed (-110)
[135456.590145] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR*
resume of IP block <gfx_v7_0> failed -110
[135456.590169] amdgpu 0000:01:00.0: GPU reset(2) failed

This line gets repeated about 10 times:

[135456.590183] [drm] Skip scheduling IBs!

And the kernel tries to reset the gpu again:

[135456.590362] amdgpu 0000:01:00.0: GPU reset end with ret = -110
[135466.655925] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0
timeout, signaled seq=834763, emitted seq=834763
[135466.656025] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
information: process  pid 0 thread  pid 0
[135466.656032] amdgpu 0000:01:00.0: GPU reset begin!
[135578.700184] INFO: task kworker/0:2:24837 blocked for more than 120 seconds.
[135578.700189]       Tainted: G        W         5.1.0-rc3 #1
[135578.700191] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[135578.700194] kworker/0:2     D    0 24837      2 0x80000000
[135578.700206] Workqueue: events drm_sched_job_timedout [gpu_sched]
[135578.700208] Call Trace:
[135578.700217]  ? __schedule+0x2d4/0x870
[135578.700220]  ? __switch_to_asm+0x34/0x70
[135578.700223]  ? __switch_to_asm+0x40/0x70
[135578.700225]  schedule+0x28/0x70
[135578.700228]  schedule_timeout+0x26d/0x390
[135578.700230]  ? __schedule+0x2dc/0x870
[135578.700234]  dma_fence_default_wait+0x238/0x2a0
[135578.700236]  ? dma_fence_release+0x90/0x90
[135578.700239]  dma_fence_wait_timeout+0xdd/0x100
[135578.700243]  drm_sched_stop+0xf2/0x130 [gpu_sched]
[135578.700340]  amdgpu_device_pre_asic_reset+0x3f/0x204 [amdgpu]
[135578.700428]  amdgpu_device_gpu_recover+0x7b/0x743 [amdgpu]
[135578.700514]  amdgpu_job_timedout+0xfc/0x120 [amdgpu]
[135578.700571]  drm_sched_job_timedout+0x39/0x60 [gpu_sched]
[135578.700576]  process_one_work+0x1a7/0x3b0
[135578.700578]  worker_thread+0x30/0x390
[135578.700581]  ? create_worker+0x1a0/0x1a0
[135578.700583]  kthread+0x112/0x130
[135578.700585]  ? __kthread_parkme+0x70/0x70
[135578.700587]  ret_from_fork+0x22/0x40
[135578.700594] INFO: task kworker/0:1:29946 blocked for more than 120 seconds.
[135578.700596]       Tainted: G        W         5.1.0-rc3 #1
[135578.700598] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[135578.700599] kworker/0:1     D    0 29946      2 0x80000000
[135578.700605] Workqueue: events drm_sched_job_finish [gpu_sched]
[135578.700606] Call Trace:
[135578.700609]  ? __schedule+0x2d4/0x870
[135578.700611]  schedule+0x28/0x70
[135578.700613]  schedule_timeout+0x26d/0x390
[135578.700616]  wait_for_completion+0x11f/0x190
[135578.700619]  ? wake_up_q+0x80/0x80
[135578.700622]  __flush_work+0x130/0x1f0
[135578.700624]  ? worker_attach_to_pool+0x90/0x90
[135578.700626]  ? get_work_pool+0x40/0x40
[135578.700629]  __cancel_work_timer+0x10a/0x190
[135578.700632]  ? __switch_to+0x15b/0x460
[135578.700634]  ? __switch_to_asm+0x34/0x70
[135578.700636]  ? __switch_to_asm+0x40/0x70
[135578.700639]  drm_sched_job_finish+0x25/0x60 [gpu_sched]
[135578.700642]  process_one_work+0x1a7/0x3b0
[135578.700644]  worker_thread+0x30/0x390
[135578.700647]  ? create_worker+0x1a0/0x1a0
[135578.700648]  kthread+0x112/0x130
[135578.700650]  ? __kthread_parkme+0x70/0x70
[135578.700652]  ret_from_fork+0x22/0x40
[135699.405742] INFO: task kworker/0:2:24837 blocked for more than 241 seconds.
[135699.405748]       Tainted: G        W         5.1.0-rc3 #1
[135699.405750] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[135699.405752] kworker/0:2     D    0 24837      2 0x80000000
[135699.405764] Workqueue: events drm_sched_job_timedout [gpu_sched]
[135699.405765] Call Trace:
[135699.405773]  ? __schedule+0x2d4/0x870
[135699.405776]  ? __switch_to_asm+0x34/0x70
[135699.405779]  ? __switch_to_asm+0x40/0x70
[135699.405782]  schedule+0x28/0x70
[135699.405785]  schedule_timeout+0x26d/0x390
[135699.405787]  ? __schedule+0x2dc/0x870
[135699.405791]  dma_fence_default_wait+0x238/0x2a0
[135699.405794]  ? dma_fence_release+0x90/0x90
[135699.405796]  dma_fence_wait_timeout+0xdd/0x100
[135699.405800]  drm_sched_stop+0xf2/0x130 [gpu_sched]
[135699.405900]  amdgpu_device_pre_asic_reset+0x3f/0x204 [amdgpu]
[135699.405991]  amdgpu_device_gpu_recover+0x7b/0x743 [amdgpu]
[135699.406081]  amdgpu_job_timedout+0xfc/0x120 [amdgpu]
[135699.406086]  drm_sched_job_timedout+0x39/0x60 [gpu_sched]
[135699.406089]  process_one_work+0x1a7/0x3b0
[135699.406092]  worker_thread+0x30/0x390
[135699.406095]  ? create_worker+0x1a0/0x1a0
[135699.406096]  kthread+0x112/0x130
[135699.406098]  ? __kthread_parkme+0x70/0x70
[135699.406101]  ret_from_fork+0x22/0x40
[135699.406108] INFO: task kworker/0:1:29946 blocked for more than 241 seconds.
[135699.406110]       Tainted: G        W         5.1.0-rc3 #1
[135699.406111] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[135699.406113] kworker/0:1     D    0 29946      2 0x80000000
[135699.406118] Workqueue: events drm_sched_job_finish [gpu_sched]
[135699.406119] Call Trace:
[135699.406122]  ? __schedule+0x2d4/0x870
[135699.406125]  schedule+0x28/0x70
[135699.406127]  schedule_timeout+0x26d/0x390
[135699.406130]  wait_for_completion+0x11f/0x190
[135699.406133]  ? wake_up_q+0x80/0x80
[135699.406136]  __flush_work+0x130/0x1f0
[135699.406138]  ? worker_attach_to_pool+0x90/0x90
[135699.406141]  ? get_work_pool+0x40/0x40
[135699.406143]  __cancel_work_timer+0x10a/0x190
[135699.406145]  ? __switch_to+0x15b/0x460
[135699.406148]  ? __switch_to_asm+0x34/0x70
[135699.406150]  ? __switch_to_asm+0x40/0x70
[135699.406153]  drm_sched_job_finish+0x25/0x60 [gpu_sched]
[135699.406156]  process_one_work+0x1a7/0x3b0
[135699.406158]  worker_thread+0x30/0x390
[135699.406161]  ? create_worker+0x1a0/0x1a0
[135699.406162]  kthread+0x112/0x130
[135699.406164]  ? __kthread_parkme+0x70/0x70
[135699.406167]  ret_from_fork+0x22/0x40
[135699.406170] INFO: task kworker/2:0:29950 blocked for more than 120 seconds.
[135699.406171]       Tainted: G        W         5.1.0-rc3 #1
[135699.406173] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[135699.406174] kworker/2:0     D    0 29950      2 0x80000000
[135699.406179] Workqueue: events drm_sched_job_timedout [gpu_sched]
[135699.406180] Call Trace:
[135699.406183]  ? __schedule+0x2d4/0x870
[135699.406185]  schedule+0x28/0x70
[135699.406187]  schedule_preempt_disabled+0xa/0x10
[135699.406190]  __mutex_lock.isra.8+0x2d0/0x4a0
[135699.406277]  amdgpu_device_lock_adev+0x15/0x37 [amdgpu]
[135699.406364]  amdgpu_device_gpu_recover+0x69/0x743 [amdgpu]
[135699.406453]  amdgpu_job_timedout+0xfc/0x120 [amdgpu]
[135699.406457]  drm_sched_job_timedout+0x39/0x60 [gpu_sched]
[135699.406460]  process_one_work+0x1a7/0x3b0
[135699.406463]  worker_thread+0x30/0x390
[135699.406465]  ? create_worker+0x1a0/0x1a0
[135699.406467]  kthread+0x112/0x130
[135699.406468]  ? __kthread_parkme+0x70/0x70
[135699.406471]  ret_from_fork+0x22/0x40

At this point I decided that the system isn't going to recover, so I
tried to soft-reboot it, but it only ended locked and had to hard
reboot.

I've found this bug[1] that seems related, user watching videos. I
will try removing vaapi and check if that improves anything.

[1]: https://bugs.freedesktop.org/show_bug.cgi?id=106547

On Wed, Apr 3, 2019 at 6:38 PM Braiam <braiamp at gmail.com> wrote:
>
> On Wed, Apr 3, 2019 at 2:02 PM Alex Deucher <alexdeucher at gmail.com> wrote:
> >
> > On Wed, Apr 3, 2019 at 2:58 AM Braiam <braiamp at gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I have a Sapphire Technology Hawaii XT (R9 290X) using amdgpu driver
> > > with kernel 5.1.0-rc3.
> > > The issue happens with current 4.19.0 debian testing, 4.20-trunk,
> > > 5.0.0-trunk and rc2 and 3.
> > >
> > > It usually happens when I'm reproducing video, but I haven't figured
> > > out a way to reproduce it. It
> > > happened once without reproducing. I'm aware that the support is
> > > experimental, but radeon
> > > driver doesn't seems capable of direct rendering on this card dropping
> > > to llvmepipe.
> >
> > Radeon should work out of the box.  Maybe something is messed up with
> > your install?
>
> Doubtful, since I used 2200G without issues. Only change was using amdgpu
> due games being particularly slow.
>
> >
> > >
> > > I had a ssh server installed in case I could log in while it crashes,
> > > and the only relevant
> > > line I found was:
> > >
> > > drm:amdgpu job timeout [amdgpu]] **ERROR** ring gfx timeout, signaled
> > > seq=399919, emitted seq=399921
> > >
> > > But that turned several bug reports which seems to have been fixed and
> > > the context and symptoms are too different to mine.
> > >
> >
> > You appear to be experiencing a GPU lockup.  Unfortunately, there can
> > be many things that cause it, so it really helps to have a good
> > reproducer case.  You might try a newer version of mesa or llvm.  What
> > does your "reproducing video" work flow use?  What apps, APIs are
> > involved?
>
> By "reproducing video" I mean watching videos from either Youtube/Netflix
> with Firefox or mpv for local files. But note that I've experienced at
> least one lock up
> without any video on screen (I don't remember if there was something paused
> on the background).
>
> Using mesa 18.3.4, and according to glxinfo DRM 3.30.0, LLVM 7.0.1. With
> vulkan, vdpau and va drivers on the same version. I gave up and removed
> any and every instance of the rocm package, since the module couldn't
> compile on the kernel.
>
> Another detail about the lock up: the screen doesn't freeze but no
> signal is emitted,
> so my monitor goes to sleep.
>
> Upgraded to mesa 19.0.1 and llvm 8.0.0, although I didn't see anything
> relevant on the
> changelogs that could affect my experience.
>
>
> >
> > Alex
> >
> > > I have tried forcing the amdgpu xorg driver with same results (was
> > > using radeon).
> > >
> > > --
> > > Braiam
> > > _______________________________________________
> > > amd-gfx mailing list
> > > amd-gfx at lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
>
> --
> Braiam

-- 
Braiam