[Bug 110413] GPU crash and failed reset leading to deadlock on Polaris 22 XL [Radeon RX Vega M GL]

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Fri Apr 12 14:52:44 UTC 2019


https://bugs.freedesktop.org/show_bug.cgi?id=110413

--- Comment #4 from RĂ©mi Verschelde <rverschelde at gmail.com> ---
Pasting some relevant output from attachment 143951 so that relevant keywords
can be found by Bugzilla searches.

```
[  325.087186] mce: CPU7: Core temperature above threshold, cpu clock throttled
(total events = 1)
[  325.087187] mce: CPU3: Core temperature above threshold, cpu clock throttled
(total events = 1)
[  325.087188] mce: CPU3: Package temperature above threshold, cpu clock
throttled (total events = 1)
[  325.087189] mce: CPU7: Package temperature above threshold, cpu clock
throttled (total events = 1)
[  325.087224] mce: CPU5: Package temperature above threshold, cpu clock
throttled (total events = 1)
[  325.087225] mce: CPU0: Package temperature above threshold, cpu clock
throttled (total events = 1)
[  325.087226] mce: CPU1: Package temperature above threshold, cpu clock
throttled (total events = 1)
[  325.087226] mce: CPU4: Package temperature above threshold, cpu clock
throttled (total events = 1)
[  325.087227] mce: CPU6: Package temperature above threshold, cpu clock
throttled (total events = 1)
[  325.087228] mce: CPU2: Package temperature above threshold, cpu clock
throttled (total events = 1)
[  325.089212] mce: CPU7: Core temperature/speed normal
[  325.089213] mce: CPU0: Package temperature/speed normal
[  325.089214] mce: CPU3: Core temperature/speed normal
[  325.089214] mce: CPU4: Package temperature/speed normal
[  325.089215] mce: CPU7: Package temperature/speed normal
[  325.089215] mce: CPU3: Package temperature/speed normal
[  325.089248] mce: CPU6: Package temperature/speed normal
[  325.089248] mce: CPU5: Package temperature/speed normal
[  325.089249] mce: CPU2: Package temperature/speed normal
[  325.089250] mce: CPU1: Package temperature/speed normal
[  565.312183] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0040d508 for
process  pid 0 thread  pid 0
[  565.312194] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
0x00169208
[  565.312200] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
0xFFFFFFFF
[  565.312209] amdgpu 0000:01:00.0: VM fault (0xff, vmid 15, pasid 0) at page
1479176, write from '\xff\xff\xff\xff' (0xffffffff) (511)
[  565.312219] amdgpu 0000:01:00.0: GPU fault detected: 147 0x00405508 for
process  pid 0 thread  pid 0
[  565.312224] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
0xFFFFFFFF
[  565.312229] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
0xFFFFFFFF
[  565.312236] amdgpu 0000:01:00.0: VM fault (0xff, vmid 15, pasid 0) at page
4294967295, write from '\xff\xff\xff\xff' (0xffffffff) (511)
[  565.312244] amdgpu 0000:01:00.0: GPU fault detected: 147 0x00485508 for
process  pid 0 thread  pid 0
[  565.312248] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
0xFFFFFFFF
[  565.312252] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
0xFFFFFFFF
[  565.312258] amdgpu 0000:01:00.0: VM fault (0xff, vmid 15, pasid 0) at page
4294967295, write from '\xff\xff\xff\xff' (0xffffffff) (511)

<snip>

[  565.312378] amdgpu 0000:01:00.0: GPU fault detected: 147 0x00785508 for
process  pid 0 thread  pid 0
[  565.312383] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR  
0xFFFFFFFF
[  565.312387] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS
0xFFFFFFFF
[  565.312393] amdgpu 0000:01:00.0: VM fault (0xff, vmid 15, pasid 0) at page
4294967295, write from '\xff\xff\xff\xff' (0xffffffff) (511)
[  575.625913] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
signaled seq=117668, emitted seq=117670
[  575.625950] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
process starcrawlers.x8 pid 9151 thread starcrawle:cs0 pid 9162
[  575.625953] amdgpu 0000:01:00.0: GPU reset begin!
[  575.626419] amdgpu: [powerplay] 
                last message was failed ret is 65535
[  575.626420] amdgpu: [powerplay] 
                failed to send message 281 ret is 65535 
[  575.636259] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend
of IP block <vce_v3_0> failed -110
[  575.651311] amdgpu: [powerplay] 
                last message was failed ret is 65535
[  575.651312] amdgpu: [powerplay] 
                failed to send message 133 ret is 65535 
[  575.651316] amdgpu: [powerplay] 
                last message was failed ret is 65535
[  575.651316] amdgpu: [powerplay] 
                failed to send message 310 ret is 65535 
[  575.651317] amdgpu: [powerplay] 
                last message was failed ret is 65535
[  575.651317] amdgpu: [powerplay] 
                failed to send message 5e ret is 65535 

<snip>

[  575.651340] amdgpu: [powerplay] 
                last message was failed ret is 65535
[  575.651341] amdgpu: [powerplay] 
                failed to send message 84 ret is 65535 
[  575.651341] amdgpu: [powerplay] Failed to force to switch arbf0!
[  575.651342] amdgpu: [powerplay] [disable_dpm_tasks] Failed to disable DPM!
[  575.651360] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend
of IP block <powerplay> failed -22
[  575.769673] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]]
*ERROR* ring kiq_2.1.0 test failed (-110)
[  575.769740] [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[  575.888355] cp is busy, skip halt cp
[  576.007183] rlc is busy, skip halt rlc
[  576.008188] amdgpu 0000:01:00.0: GPU pci config reset
[  576.126260] [drm:amdgpu_device_gpu_recover [amdgpu]] *ERROR* ASIC reset
failed with err r, -22 for drm dev, 0000:01:00.0
[  576.127736] Asynchronous wait on fence drm_sched:gfx:1ca87 timed out
(hint:submit_notify+0x0/0x58 [i915])
[  576.127768] Asynchronous wait on fence drm_sched:gfx:1ca82 timed out
(hint:submit_notify+0x0/0x58 [i915])
[  576.127788] Asynchronous wait on fence i915:Xorg[3673]/0:6455 timed out
(hint:intel_atomic_commit_ready+0x0/0x4c [i915])
[  581.126683] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for
more than 5secs aborting
[  581.126734] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios
stuck executing D654 (len 62, WS 0, PS 0) @ 0xD670
[  581.126754] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios
stuck executing C410 (len 114, WS 0, PS 8) @ 0xC42B
[  581.126755] [drm] asic atom init failed!
[  581.126765] amdgpu 0000:01:00.0: GPU reset(2) failed
[  581.126766] amdgpu 0000:01:00.0: GPU reset end with ret = -22
[  581.126777] [drm] Skip scheduling IBs!
[  581.126782] [drm] Skip scheduling IBs!
[  581.126784] [drm] Skip scheduling IBs!
[  581.126785] [drm] Skip scheduling IBs!
[  581.126786] [drm] Skip scheduling IBs!
[  581.126787] [drm] Skip scheduling IBs!
[  581.126789] [drm] Skip scheduling IBs!
[  581.126790] [drm] Skip scheduling IBs!
[  581.126791] [drm] Skip scheduling IBs!
[  591.487678] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
signaled seq=117670, emitted seq=117670
[  591.487716] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
process starcrawlers.x8 pid 9151 thread starcrawle:cs0 pid 9162
[  591.487719] amdgpu 0000:01:00.0: GPU reset begin!
[  591.488418] amdgpu: [powerplay] 
                last message was failed ret is 65535
[  591.488419] amdgpu: [powerplay] 
                failed to send message 281 ret is 65535 
[  591.488495] WARNING: CPU: 2 PID: 666 at
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:788
dm_suspend+0x4e/0x60 [amdgpu]
[  591.488496] Modules linked in: cmac rfcomm ccm msr ip6t_REJECT
nf_reject_ipv6 xt_comment ip6table_mangle ip6table_nat nf_nat_ipv6 ip6table_raw
nf_log_ipv6 ip6table_filter ip6_tables xt_recent ipt_IFWLOG ipt_psd xt_set
ip_set_hash_ip ip_set ipt_REJECT nf_reject_ipv4 xt_conntrack xt_hashlimit
xt_addrtype xt_mark iptable_mangle iptable_nat nf_nat_ipv4 xt_CT xt_tcpudp
iptable_raw nfnetlink_log xt_NFLOG nf_log_ipv4 nf_log_common xt_LOG
nf_conntrack_sane nf_conntrack_netlink nfnetlink nf_nat_tftp nf_nat_snmp_basic
nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_irc nf_nat_h323 nf_nat_ftp
nf_nat_amanda nf_nat nf_conntrack_tftp nf_conntrack_sip nf_conntrack_pptp
nf_conntrack_proto_gre nf_conntrack_netbios_ns nf_conntrack_broadcast
nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp ts_kmp nf_conntrack_amanda
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter af_packet bnep
binfmt_misc fuse nls_iso8859_1 nls_cp437 vfat fat dm_mirror dm_region_hash
dm_log dm_mod snd_hda_codec_hdmi arc4 joydev
[  591.488509]  intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm
hid_sensor_incl_3d hid_sensor_gyro_3d hid_sensor_magn_3d hid_sensor_rotation
hid_sensor_accel_3d hid_sensor_trigger industrialio_triggered_buffer kfifo_buf
hid_sensor_iio_common industrialio irqbypass hid_multitouch crc32_pclmul
crc32c_intel ghash_clmulni_intel spi_pxa2xx_platform 8250_dw iwlmvm
hid_sensor_hub aesni_intel iTCO_wdt iTCO_vendor_support mac80211
snd_hda_codec_realtek hid_generic aes_x86_64 input_leds tpm_crb crypto_simd
cryptd snd_hda_codec_generic glue_helper ledtrig_audio intel_cstate psmouse
intel_uncore iwlwifi snd_hda_intel thermal snd_hda_codec uvcvideo btusb
snd_hda_core btbcm videobuf2_vmalloc btrtl videobuf2_memops videobuf2_v4l2
btintel videobuf2_common cfg80211 snd_hwdep videodev snd_pcm bluetooth media
snd_timer intel_rapl_perf pinctrl_sunrisepoint ucsi_acpi typec_ucsi usbhid
typec tpm_tis pinctrl_intel intel_wmi_thunderbolt snd tpm_tis_core hp_wmi
soundcore tpm wmi_bmof idma64 ecdh_generic
[  591.488521]  int3400_thermal battery virt_dma button acpi_thermal_rel
rtsx_pci_ms intel_vbtn i2c_i801 acpi_pad hp_wireless ac rfkill sparse_keymap
int3403_thermal memstick mei_me mei intel_lpss_pci intel_pch_thermal intel_lpss
processor_thermal_device intel_ishtp_hid int340x_thermal_zone
intel_soc_dts_iosf evdev nvram sch_fq_codel efivarfs ip_tables x_tables ipv6
crc_ccitt autofs4 amdgpu xhci_pci rtsx_pci_sdmmc xhci_hcd mmc_block mmc_core
usbcore serio_raw chash amd_iommu_v2 rtsx_pci gpu_sched intel_ish_ipc ttm
intel_ishtp usb_common i915 i2c_hid hid i2c_algo_bit drm_kms_helper wmi video
drm
[  591.488549] CPU: 2 PID: 666 Comm: kworker/2:2 Not tainted
5.0.7-desktop-4.mga7 #1
[  591.488550] Hardware name: HP HP Spectre x360 Convertible 15-ch0xx/83BB,
BIOS F.24 11/06/2018
[  591.488552] Workqueue: events drm_sched_job_timedout [gpu_sched]
[  591.488627] RIP: 0010:dm_suspend+0x4e/0x60 [amdgpu]
[  591.488627] Code: 00 48 89 83 70 cb 00 00 e8 af fc ff ff 48 89 df e8 67 75
00 00 48 8b bb 60 b3 00 00 be 08 00 00 00 e8 16 8f 0a 00 31 c0 5b c3 <0f> 0b eb
c1 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f 44 00
[  591.488628] RSP: 0018:ffffb50201f97d20 EFLAGS: 00010282
[  591.488629] RAX: ffffffffc08a3e00 RBX: ffff93f4a35c0000 RCX:
0000000000000012
[  591.488629] RDX: 0000000000000080 RSI: 0000000000000001 RDI:
ffff93f4a35c0000
[  591.488629] RBP: ffff93f4a35ccb98 R08: 0000000000000492 R09:
0000000000000004
[  591.488630] R10: 0000000000000000 R11: 0000000000000001 R12:
ffff93f4a35c0000
[  591.488630] R13: ffffffffc09e25a0 R14: 0000000000000000 R15:
ffff93f4a35c3498
[  591.488631] FS:  0000000000000000(0000) GS:ffff93f4b1c80000(0000)
knlGS:0000000000000000
[  591.488631] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  591.488632] CR2: 00007f8c18a40a38 CR3: 000000033220e002 CR4:
00000000003606e0
[  591.488632] Call Trace:
[  591.488676]  amdgpu_device_ip_suspend_phase1+0x94/0xc0 [amdgpu]
[  591.488721]  amdgpu_device_ip_suspend+0x1b/0x60 [amdgpu]
[  591.488796]  amdgpu_device_pre_asic_reset+0x9e/0x260 [amdgpu]
[  591.488817]  amdgpu_device_gpu_recover+0x87/0x7e0 [amdgpu]
[  591.488828]  ? drm_err+0x72/0x90 [drm]
[  591.488882]  amdgpu_job_timedout+0xfc/0x120 [amdgpu]
[  591.488884]  drm_sched_job_timedout+0x39/0x60 [gpu_sched]
[  591.488887]  process_one_work+0x200/0x400
[  591.488888]  worker_thread+0x2d/0x3d0
[  591.488889]  ? process_one_work+0x400/0x400
[  591.488891]  kthread+0x112/0x130
[  591.488892]  ? kthread_create_on_node+0x60/0x60
[  591.488894]  ret_from_fork+0x35/0x40
[  591.488895] ---[ end trace 356c1ae357df635c ]---
[  591.499325] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend
of IP block <vce_v3_0> failed -110
```

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20190412/7cbe81e9/attachment.html>


More information about the dri-devel mailing list