[Bug 110413] GPU crash and failed reset leading to deadlock on Polaris 22 XL [Radeon RX Vega M GL]

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Sun Jul 28 15:41:10 UTC 2019


https://bugs.freedesktop.org/show_bug.cgi?id=110413

--- Comment #10 from Alex Behling <alexander.behling at googlemail.com> ---
>From my experience this seems to be a thermal problem. I have the exact same
hardware configuration running latest Archlinux Kernel. 

$ uname -a
Linux lexnote 5.2.3-arch1-1-ARCH #1 SMP PREEMPT Fri Jul 26 08:13:47 UTC 2019
x86_64 GNU/Linux

If I leave leave the system with the default PM settings (Profile Performance
or Balance doesn't matter) sooner or later I will get Lock-Ups in any game or
application with higher GPU loads.


EXAMPLE DMESG OUTPUT:

[Do Jul 25 23:33:45 2019] amdgpu 0000:01:00.0: GPU pci config reset
[Do Jul 25 23:33:53 2019] [drm] PCIE GART of 256M enabled (table at
0x000000F400000000).
[Do Jul 25 23:33:53 2019] amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper
[amdgpu]] *ERROR* ring gfx test failed (-110)
[Do Jul 25 23:33:53 2019] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR*
resume of IP block <gfx_v8_0> failed -110
[Do Jul 25 23:33:53 2019] [drm:amdgpu_device_resume [amdgpu]] *ERROR*
amdgpu_device_ip_resume failed (-110).
[Do Jul 25 23:33:53 2019] [drm] schedsdma0 is not ready, skipping
[Do Jul 25 23:33:53 2019] [drm] schedsdma1 is not ready, skipping
[Do Jul 25 23:33:59 2019] WARNING: CPU: 1 PID: 20969 at
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:891
dm_suspend+0x4e/0x60 [amdgpu]
[Do Jul 25 23:33:59 2019] Modules linked in: msr fuse 8021q garp mrp stp llc
ccm snd_hda_codec_hdmi hid_sensor_gyro_3d hid_sensor_accel_3d
hid_sensor_magn_3d hid_sensor_rotation hid_sensor_incl_3d hid_sensor_trigger
industrialio_triggered_buffer kfifo_buf hid_sensor_iio_common industrialio
hid_sensor_hub intel_ishtp_loader intel_ishtp_hid arc4 iwlmvm mousedev
cdc_ether usbnet r8152 xpad ff_memless joydev mii mac80211 uvcvideo btusb
videobuf2_vmalloc hid_logitech_hidpp videobuf2_memops btrtl btbcm nls_iso8859_1
videobuf2_v4l2 btintel nls_cp437 videobuf2_common bluetooth vfat fat videodev
media spi_pxa2xx_platform ecdh_generic iTCO_wdt 8250_dw hid_multitouch ecc
mei_hdcp iTCO_vendor_support iwlwifi intel_rapl hp_wmi x86_pkg_temp_thermal
wmi_bmof intel_powerclamp intel_wmi_thunderbolt coretemp kvm_intel
snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio kvm psmouse
input_leds snd_hda_intel cfg80211 irqbypass intel_cstate snd_hda_codec
snd_hda_core intel_uncore snd_hwdep intel_rapl_perf snd_pcm
[Do Jul 25 23:33:59 2019]  rtsx_pci_ms memstick snd_timer pcspkr mei_me
intel_ish_ipc processor_thermal_device snd idma64 int3403_thermal ucsi_acpi
i2c_i801 soundcore typec_ucsi rfkill intel_lpss_pci mei tpm_crb
int340x_thermal_zone intel_pch_thermal intel_ishtp intel_soc_dts_iosf
intel_lpss i2c_hid typec wmi tpm_tis tpm_tis_core tpm rng_core intel_vbtn
battery sparse_keymap hp_wireless evdev mac_hid int3400_thermal
acpi_thermal_rel ac pcc_cpufreq vboxnetflt(OE) vboxnetadp(OE) vboxpci(OE)
vboxdrv(OE) sg crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache
jbd2 algif_skcipher af_alg hid_logitech_dj hid_generic usbhid hid dm_crypt
crct10dif_pclmul crc32_pclmul dm_mod crc32c_intel ghash_clmulni_intel
rtsx_pci_sdmmc serio_raw mmc_core atkbd libps2 ahci libahci aesni_intel libata
aes_x86_64 crypto_simd cryptd xhci_pci glue_helper scsi_mod xhci_hcd rtsx_pci
i8042 serio amdgpu amd_iommu_v2 gpu_sched ttm i915 intel_gtt i2c_algo_bit
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm
[Do Jul 25 23:33:59 2019]  agpgart
[Do Jul 25 23:33:59 2019] CPU: 1 PID: 20969 Comm: kworker/1:2 Tainted: G       
   OE     5.2.1-arch1-1-ARCH #1
[Do Jul 25 23:33:59 2019] Hardware name: HP HP Spectre x360 Convertible
15-ch0xx/83BB, BIOS F.24 11/06/2018
[Do Jul 25 23:33:59 2019] Workqueue: pm pm_runtime_work
[Do Jul 25 23:33:59 2019] RIP: 0010:dm_suspend+0x4e/0x60 [amdgpu]
[Do Jul 25 23:33:59 2019] Code: 00 48 89 83 70 e9 00 00 e8 9f fc ff ff 48 89 df
e8 97 83 00 00 48 8b bb 70 cf 00 00 be 08 00 00 00 e8 b6 9a 08 00 31 c0 5b c3
<0f> 0b eb c1 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f 44 00
[Do Jul 25 23:33:59 2019] RSP: 0018:ffffb286869cfcb8 EFLAGS: 00010282
[Do Jul 25 23:33:59 2019] RAX: ffffffffc0675ed0 RBX: ffffa1b9e0d30000 RCX:
ffffffffc073e980
[Do Jul 25 23:33:59 2019] RDX: 0000000000000080 RSI: 0000000000000001 RDI:
ffffa1b9e0d30000
[Do Jul 25 23:33:59 2019] RBP: ffffa1b9e0d3e998 R08: 0000000000000001 R09:
0000000000000018
[Do Jul 25 23:33:59 2019] R10: fefefefefefefeff R11: 0000000000000000 R12:
ffffa1b9e0d30000
[Do Jul 25 23:33:59 2019] R13: 0000000000000000 R14: 0000000000000000 R15:
ffffa1b9ebc8bd80
[Do Jul 25 23:33:59 2019] FS:  0000000000000000(0000) GS:ffffa1b9eea40000(0000)
knlGS:0000000000000000
[Do Jul 25 23:33:59 2019] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Do Jul 25 23:33:59 2019] CR2: 00007f1d9c02b000 CR3: 0000000469058002 CR4:
00000000003606e0
[Do Jul 25 23:33:59 2019] Call Trace:
[Do Jul 25 23:33:59 2019]  amdgpu_device_ip_suspend_phase1+0x8e/0xc0 [amdgpu]
[Do Jul 25 23:33:59 2019]  amdgpu_device_suspend+0x234/0x390 [amdgpu]
[Do Jul 25 23:33:59 2019]  amdgpu_pmops_runtime_suspend+0x41/0xb0 [amdgpu]
[Do Jul 25 23:33:59 2019]  pci_pm_runtime_suspend+0x5b/0x150
[Do Jul 25 23:33:59 2019]  ? __switch_to_asm+0x40/0x70
[Do Jul 25 23:33:59 2019]  vga_switcheroo_runtime_suspend+0x25/0xb0
[Do Jul 25 23:33:59 2019]  ? vga_switcheroo_runtime_resume+0x60/0x60
[Do Jul 25 23:33:59 2019]  __rpm_callback+0x7b/0x130
[Do Jul 25 23:33:59 2019]  ? vga_switcheroo_runtime_resume+0x60/0x60
[Do Jul 25 23:33:59 2019]  ? vga_switcheroo_runtime_resume+0x60/0x60
[Do Jul 25 23:33:59 2019]  rpm_callback+0x2a/0x90
[Do Jul 25 23:33:59 2019]  ? vga_switcheroo_runtime_resume+0x60/0x60
[Do Jul 25 23:33:59 2019]  rpm_suspend+0x136/0x610
[Do Jul 25 23:33:59 2019]  pm_runtime_work+0x94/0xa0
[Do Jul 25 23:33:59 2019]  process_one_work+0x1d1/0x3e0
[Do Jul 25 23:33:59 2019]  worker_thread+0x4a/0x3d0
[Do Jul 25 23:33:59 2019]  kthread+0xfd/0x130
[Do Jul 25 23:33:59 2019]  ? process_one_work+0x3e0/0x3e0
[Do Jul 25 23:33:59 2019]  ? kthread_park+0x90/0x90
[Do Jul 25 23:33:59 2019]  ret_from_fork+0x35/0x40
[Do Jul 25 23:33:59 2019] ---[ end trace 69a711ec632dab70 ]---
[Do Jul 25 23:33:59 2019] amdgpu: [powerplay] Trying to disable SCLK DPM when
DPM is disabled
[Do Jul 25 23:33:59 2019] amdgpu: [powerplay] Trying to disable voltage DPM
when DPM is disabled
[Do Jul 25 23:33:59 2019] amdgpu: [powerplay] Failed to force to switch arbf0!
[Do Jul 25 23:33:59 2019] amdgpu: [powerplay] [disable_dpm_tasks] Failed to
disable DPM!
[Do Jul 25 23:33:59 2019] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]]
*ERROR* suspend of IP block <powerplay> failed -22
[Do Jul 25 23:34:00 2019] cp is busy, skip halt cp
[Do Jul 25 23:34:00 2019] rlc is busy, skip halt rlc
[Do Jul 25 23:34:00 2019] amdgpu 0000:01:00.0: GPU pci config reset

If I reduce the maximum CPU Frequency to 2.2GHz keeping the temperatures of CPU
cores and GPU just below 60 degree Celsius the problem does not occur anymore.

$ sudo cpupower frequency-set -u 2.2GHz

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20190728/c6836df4/attachment-0001.html>


More information about the dri-devel mailing list