[Bug 105760] [4.17-rc1] RIP: smu7_populate_single_firmware_entry.isra.6+0x57/0xc0 [amdgpu] RSP: ffffa17901efb930

Thu Jul 12 19:38:06 UTC 2018

https://bugs.freedesktop.org/show_bug.cgi?id=105760

Thomas Martitz <kugel at rockbox.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #140592|0                           |1
        is obsolete|                            |

--- Comment #42 from Thomas Martitz <kugel at rockbox.org> ---
Created attachment 140611
  --> https://bugs.freedesktop.org/attachment.cgi?id=140611&action=edit
dmesg with 0001-workaround-v2.patch +
0001-drm-amdgpu-add-ATPX-quirk-for-a-polaris-12-laptop.patch

Sorry to say, but this patch makes things actually *worse*.

First, by accident, I added your latest patch on-top of my previous workaround
v2. This gives working suspend/resume but many more error messages in dmesg, in
particular a WARN() triggers:

[  385.996911] Modules linked in: cmac rfcomm ccm arc4 snd_hda_codec_hdmi
snd_hda_codec_conexant snd_hda_codec_generic joydev intel_rapl mousedev
x86_pkg_temp_thermal intel_powerclamp bnep coretemp iwlmvm snd_soc_skl
snd_soc_skl_ipc hid_multitouch snd_soc_sst_ipc mac80211 snd_soc_sst_dsp
hid_generic kvm snd_hda_ext_core mei_wdt snd_soc_core nls_iso8859_1 i915
nls_cp437 iwlwifi btusb irqbypass btrtl vfat btbcm crct10dif_pclmul
snd_compress btintel iTCO_wdt crc32_pclmul iTCO_vendor_support snd_soc_acpi
ghash_clmulni_intel fat bluetooth pcbc intel_wmi_thunderbolt hp_wmi
sparse_keymap snd_hda_intel wmi_bmof snd_hda_codec cfg80211 crc16 snd_hwdep
ecdh_generic aesni_intel snd_hda_core aes_x86_64 crypto_simd snd_pcm cryptd
e1000e snd_timer glue_helper intel_cstate intel_uncore intel_rapl_perf uvcvideo
idma64
[  385.996938]  tpm_crb input_leds led_class videobuf2_vmalloc snd psmouse
videobuf2_memops mei_me i2c_i801 ptp mei videobuf2_v4l2 pps_core
processor_thermal_device ucsi_acpi i2c_hid videobuf2_common typec_ucsi
intel_lpss_pci soundcore typec rfkill intel_pch_thermal wmi intel_gtt
intel_lpss intel_soc_dts_iosf hid videodev tpm_tis tpm_tis_core int3403_thermal
int340x_thermal_zone rtc_cmos media evdev tpm ac int3400_thermal mac_hid
battery acpi_thermal_rel rng_core hp_wireless sg scsi_mod crypto_user ip_tables
x_tables btrfs libcrc32c crc32c_generic xor zstd_decompress zstd_compress
xxhash serio_raw raid6_pq atkbd libps2 xhci_pci xhci_hcd crc32c_intel usbcore
usb_common i8042 serio
[  385.996964] CPU: 4 PID: 215 Comm: kworker/4:2 Tainted: G     U  W        
4.18.0-rc3-custom+ #73
[  385.996965] Hardware name: HP HP ZBook 14u G5/83B2, BIOS Q78 Ver. 01.00.05
01/25/2018
[  385.996968] Workqueue: pm pm_runtime_work
[  385.996970] RIP: 0010:generic_reg_wait+0xe7/0x160
[  385.996970] Code: 44 24 58 8b 54 24 48 89 de 44 89 4c 24 08 48 8b 4c 24 50
48 c7 c7 f8 29 19 bd e8 c4 24 e5 ff 83 7d 20 01 44 8b 4c 24 08 74 02 <0f> 0b 48
83 c4 10 44 89 c8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 0f 
[  385.996989] RSP: 0018:ffffb45681fefbd8 EFLAGS: 00010297
[  385.996990] RAX: 000000000000006b RBX: 000000000000000a RCX:
0000000000000001
[  385.996991] RDX: 0000000080000001 RSI: ffffffffbd1151a6 RDI:
00000000ffffffff
[  385.996991] RBP: ffff97f4e37e3240 R08: ffffffffbc499790 R09:
00000000ffffffff
[  385.996992] R10: 0000000000000004 R11: ffffffffbdab8f2d R12:
0000000000000bb9
[  385.996992] R13: 0000000000004ea4 R14: 0000000000010000 R15:
0000000000000000
[  385.996993] FS:  0000000000000000(0000) GS:ffff97f4ef500000(0000)
knlGS:0000000000000000
[  385.996994] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  385.996994] CR2: 00007ff68c6ef000 CR3: 00000001ee40a002 CR4:
00000000003606e0
[  385.996995] Call Trace:
[  385.997000]  dce110_stream_encoder_dp_blank+0x11c/0x180
[  385.997002]  power_down_all_hw_blocks+0x3d/0x1c0
[  385.997003]  dce110_power_down+0xe/0x20
[  385.997005]  dc_set_power_state+0x1b/0x70
[  385.997007]  dm_suspend+0x4a/0x60
[  385.997009]  amdgpu_device_ip_suspend+0xe4/0x170
[  385.997011]  amdgpu_device_suspend+0x251/0x3a0
[  385.997013]  amdgpu_pmops_runtime_suspend+0x44/0xb0
[  385.997015]  pci_pm_runtime_suspend+0x64/0x180
[  385.997017]  ? vga_switcheroo_runtime_resume+0x60/0x60
[  385.997019]  vga_switcheroo_runtime_suspend+0x24/0xb0
[  385.997020]  __rpm_callback+0x75/0x1b0
[  385.997022]  ? __switch_to_asm+0x30/0x60
[  385.997024]  ? vga_switcheroo_runtime_resume+0x60/0x60
[  385.997025]  rpm_callback+0x1f/0x70
[  385.997026]  ? vga_switcheroo_runtime_resume+0x60/0x60
[  385.997028]  rpm_suspend+0x12a/0x610
[  385.997030]  ? finish_task_switch+0x83/0x2e0
[  385.997031]  ? __switch_to_asm+0x24/0x60
[  385.997032]  pm_runtime_work+0x7d/0xa0
[  385.997034]  process_one_work+0x1eb/0x3c0
[  385.997035]  worker_thread+0x2d/0x3d0
[  385.997037]  ? process_one_work+0x3c0/0x3c0
[  385.997038]  kthread+0x112/0x130
[  385.997039]  ? kthread_flush_work_fn+0x10/0x10
[  385.997041]  ret_from_fork+0x35/0x40
[  385.997043] ---[ end trace 04724a7f4f9fccf6 ]---

Then, there is new fatal error messages like this (the last line is new with
your patch):

[  436.030371] amdgpu: [powerplay] 
                failed to send message 261 ret is 65535 
[  436.030394] amdgpu: [powerplay] 
                last message was failed ret is 65535
[  436.030410] amdgpu: [powerplay] 
                failed to send message 261 ret is 65535 
[  436.030433] amdgpu: [powerplay] 
                last message was failed ret is 65535
[  436.030448] amdgpu: [powerplay] 
                failed to send message 261 ret is 65535 
[  436.030471] amdgpu: [powerplay] 
                last message was failed ret is 65535
[  436.030487] amdgpu: [powerplay] 
                failed to send message 261 ret is 65535 
[  436.145782] amdgpu 0000:01:00.0: GPU pci config reset
[  437.106049] [drm:amdgpu_device_suspend] *ERROR* amdgpu asic reset failed

I'm also quite sure I haven't seen the following before:
[  370.888835] [drm:gfx_v8_0_ring_test_ring] *ERROR* amdgpu: ring 0 test failed
(scratch(0xC040)=0xFFFFFFFF)
[  370.888839] [drm:amdgpu_device_ip_resume_phase2] *ERROR* resume of IP block
<gfx_v8_0> failed -22
[  370.888841] [drm:amdgpu_device_resume] *ERROR* amdgpu_device_ip_resume
failed (-22).

Most importantly, my observation that reading toc->num_entries returns -1 is
still occuring:

[  368.991914] amdgpu: [powerplay] smu7_request_smu_load_fw: 10
ffffb456a0081000 0 1
[  368.991927] amdgpu: [powerplay] smu7_request_smu_load_fw: 20
ffffb456a0081000 -1 -1

Then, after I found my workaround is still aplied, I tried without.
Unfortunately, with just your patch I can't get behind the SDDM login screen.
The laptop freezes once the KDE session loads (I'm assuming starting X causes
the freeze).g

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20180712/7cdfe599/attachment.html>