[Bug 105733] Amdgpu randomly hangs and only ssh works. Mouse cursor moves sometimes but does nothing. Keyboard stops working.

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Tue Nov 20 14:15:24 UTC 2018


https://bugs.freedesktop.org/show_bug.cgi?id=105733

--- Comment #47 from Allan <allan4229 at gmail.com> ---
I have really bad news.

I'm delaying a lot to answer because I literally sent for warranty or replaced
ALL of my components in the PC.

The CPU (R7 1800X) was replaced from a batch 21 to a new by AMD itself batched
35.

But OK, let's talk about the amdgpu :

(In reply to Andrey Grodzovsky from comment #25)
> (In reply to Allan from comment #12)
> Can you build latest kernel (4.18) and grab again latest firmware and try
> again ?
> Links to kernel and firmware:
> https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next
> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/ 

For reasons already explained here I couldn't either compile or test it before,
so please don't be mad with me :
- Sold my old PC.
- My notebook was completely filled with files.
- Components on warranty. Testing everything else.

So I managed to borrow a PC to test the video cards. I have tested only the
nvidia one to prove for AMD that the GPU is working and the pci-controller (a
guess of mine) of the CPU/chipset that is broken. Going to test the RX480 on
this PC as soon as possible. My warranties are expiring and I had to enumerate
priorities.

I already said it here but, with the 1800X I couldn't even clone the git
repository (the checksum always fails, tried many times).

Then I managed to free some space on my notebook and started to build
yesterday.
- Included amd-ucode firmware.
- Included polaris10 firmware (for RX480).
- Made some optimizations for ryzen as descbribed on the gentoo's dedicated
page.

Compiled, version 4.20-rc1 as present in the branch. No errors reported.

There are 2 main applications that are easier to test right now to find the
problems :
- Metro 2033 Redux through steam.
- Left for Dead 2 through steam.

Started Metro 2033, worked for some minutes with no issue, but it was for some
reason without any sound. Closed. Turned off the HDMI audio on pavucontrol to
use only the default output. Restarted steam.

Started Left for Dead 2 this time. Was able to change graphics settings to max
without AA and vsync. Played for 15 seconds and got a screen freeze. Waited for
a script to record properly the logs and temps. Hard rebooted. This time even
my BIOS/EFI screen had a green background, but still operational. Everything
was green except the text. Rebooted again, got back to normal colors.

And here are the logs :

kern.log about Firefox usage :
> Nov 14 05:26:50 desk kernel: [  324.714998] Chrome_~dThread[1788]: segfault at 0 ip 00007fbfee5e3181 sp 00007fbfec2d1ad0 error 6 in libxul.so[7fbfee5cf000+3a2c000]

It points that the CPU stills with either a problematic microcode or is
defective.

dmesg about amdgpu screen freeze :
> [ 3323.920795] amdgpu 0000:09:00.0: GPU fault detected: 146 0x0000080c for process hl2_linux pid 14648 thread amdgpu_cs:0 pid 14653
> [ 3323.920799] amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
> [ 3323.920801] amdgpu 0000:09:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0200800C
> [ 3323.920804] amdgpu 0000:09:00.0: VM fault (0x0c, vmid 1, pasid 32774) at page 0, read from 'TC0' (0x54433000) (8)
> [ 3334.103233] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=274140, emitted seq=274142
> [ 3334.103239] amdgpu 0000:09:00.0: GPU reset begin!
> [ 3344.332607] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:46:crtc-0] hw_done or flip_done timed out
> [ 3504.834097] INFO: task kworker/u32:2:3872 blocked for more than 120 seconds.
> [ 3504.834103]       Not tainted 4.20.0-rc1-amd #2
> [ 3504.834105] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 3504.834107] kworker/u32:2   D    0  3872      2 0x80000000
> [ 3504.834123] Workqueue: events_unbound commit_work [drm_kms_helper]
> [ 3504.834126] Call Trace:
> [ 3504.834133]  ? __schedule+0x2a0/0x880
> [ 3504.834136]  schedule+0x28/0x80
> [ 3504.834139]  schedule_timeout+0x25d/0x380
> [ 3504.834217]  ? dce110_timing_generator_get_position+0x5b/0x70 [amdgpu]
> [ 3504.834292]  ? dce110_timing_generator_get_crtc_scanoutpos+0x70/0xb0 [amdgpu]
> [ 3504.834297]  dma_fence_default_wait+0x23b/0x2a0
> [ 3504.834301]  ? dma_fence_release+0x90/0x90
> [ 3504.834304]  dma_fence_wait_timeout+0xdd/0x100
> [ 3504.834308]  reservation_object_wait_timeout_rcu+0x161/0x270
> [ 3504.834387]  amdgpu_dm_do_flip+0x112/0x370 [amdgpu]
> [ 3504.834468]  amdgpu_dm_atomic_commit_tail+0x68b/0xcd0 [amdgpu]
> [ 3504.834472]  ? __switch_to_asm+0x40/0x70
> [ 3504.834475]  ? wait_for_completion_timeout+0x3b/0x1a0
> [ 3504.834477]  ? __switch_to_asm+0x34/0x70
> [ 3504.834480]  ? __switch_to_asm+0x40/0x70
> [ 3504.834483]  ? __switch_to+0x1ba/0x450
> [ 3504.834492]  commit_tail+0x3d/0x70 [drm_kms_helper]
> [ 3504.834497]  process_one_work+0x1aa/0x3a0
> [ 3504.834500]  worker_thread+0x30/0x3a0
> [ 3504.834503]  ? drain_workqueue+0x130/0x130
> [ 3504.834506]  kthread+0x11d/0x140
> [ 3504.834509]  ? kthread_park+0x80/0x80
> [ 3504.834512]  ret_from_fork+0x22/0x40
> [ 3516.645267] WARNING: CPU: 14 PID: 14694 at kernel/kthread.c:501 kthread_park+0x6c/0x80
> [ 3516.645271] Modules linked in: fuse edac_mce_amd kvm_amd nls_ascii nls_cp437 vfat fat snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_hda_codec joydev amdgpu snd_hda_core snd_hwdep chash gpu_sched snd_pcm snd_timer ttm drm_kms_helper snd drm i2c_algo_bit sp5100_tco soundcore kvm efi_pstore efivars sg irqbypass evdev wmi_bmof serio_raw pcspkr k10temp ccp tpm_crb pcc_cpufreq tpm_tis tpm_tis_core tpm rng_core acpi_cpufreq button parport_pc ppdev lp parport efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto btrfs xor zstd_decompress zstd_compress xxhash raid6_pq libcrc32c crc32c_generic algif_skcipher af_alg dm_crypt dm_mod sd_mod hid_generic usbhid hid uas usb_storage crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ahci xhci_pci aes_x86_64 libahci crypto_simd xhci_hcd cryptd glue_helper libata r8169 i2c_piix4 libphy usbcore scsi_mod thermal wmi gpio_amdpt gpio_generic
> [ 3516.645324] CPU: 14 PID: 14694 Comm: TaskSchedulerFo Not tainted 4.20.0-rc1-amd #2
> [ 3516.645327] Hardware name: BIOSTAR Group X370GT7/X370GT7, BIOS 5.13 08/07/2018
> [ 3516.645330] RIP: 0010:kthread_park+0x6c/0x80
> [ 3516.645333] Code: 18 e8 88 6c 67 00 be 40 00 00 00 48 89 df e8 8b c3 00 00 48 85 c0 74 1b 31 c0 5b 5d c3 0f 0b eb ae 0f 0b b8 da ff ff ff eb f0 <0f> 0b b8 f0 ff ff ff eb e7 0f 0b eb e3 0f 1f 80 00 00 00 00 0f 1f
> [ 3516.645335] RSP: 0018:ffffbafdc3fcfb60 EFLAGS: 00010202
> [ 3516.645338] RAX: 0000000000000004 RBX: ffff9dcd93f140c0 RCX: dead000000000200
> [ 3516.645339] RDX: ffff9dcd92ba7430 RSI: ffff9dcd93f140c0 RDI: ffff9dcd8a9049c0
> [ 3516.645341] RBP: ffff9dcd940a5360 R08: ffff9dcd96da25a8 R09: 0000000000000000
> [ 3516.645343] R10: 0000000000000000 R11: 000000000000019c R12: ffff9dcd92ba27a0
> [ 3516.645344] R13: ffff9dcd76d34200 R14: 0000000000000206 R15: dead000000000100
> [ 3516.645347] FS:  00007efea483e700(0000) GS:ffff9dcd96d80000(0000) knlGS:0000000000000000
> [ 3516.645349] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3516.645351] CR2: 00005654fe725e10 CR3: 0000000200d40000 CR4: 00000000003406e0
> [ 3516.645352] Call Trace:
> [ 3516.645362]  drm_sched_entity_fini+0x37/0x190 [gpu_sched]
> [ 3516.645423]  amdgpu_vm_fini+0xad/0x530 [amdgpu]
> [ 3516.645429]  ? idr_destroy+0x78/0xc0
> [ 3516.645481]  amdgpu_driver_postclose_kms+0x151/0x270 [amdgpu]
> [ 3516.645496]  drm_file_free.part.5+0x21f/0x300 [drm]
> [ 3516.645510]  drm_release+0xaa/0x120 [drm]
> [ 3516.645514]  __fput+0xac/0x1e0
> [ 3516.645518]  task_work_run+0x8f/0xb0
> [ 3516.645522]  do_exit+0x2e6/0xb30
> [ 3516.645525]  do_group_exit+0x3a/0xb0
> [ 3516.645528]  get_signal+0x27a/0x5f0
> [ 3516.645532]  do_signal+0x30/0x6d0
> [ 3516.645537]  exit_to_usermode_loop+0x89/0xf0
> [ 3516.645540]  do_syscall_64+0xda/0xe0
> [ 3516.645544]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 3516.645547] RIP: 0033:0x7efeb6b9d19a
> [ 3516.645553] Code: Bad RIP value.
> [ 3516.645555] RSP: 002b:00007efea483d810 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
> [ 3516.645557] RAX: fffffffffffffdfc RBX: 00007efea483d958 RCX: 00007efeb6b9d19a
> [ 3516.645559] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007efea483d980
> [ 3516.645560] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007ffe661d7080
> [ 3516.645562] R10: 00007efea483d860 R11: 0000000000000246 R12: 0000000000000000
> [ 3516.645564] R13: 00007efea483d980 R14: 00007efea483d990 R15: 00007efea483d930
> [ 3516.645566] ---[ end trace 7da35ac4aa65c90d ]---

It is important to note that the most common code that appears while using
generic kernels is 147 despite of 146 that is being shown here.

Xorg.0.log reports nothing.

I said that these were bad news because seems to me that both CPU and amdgpu
driver are defective.

I noticed that while running kernel 4.18 the gpu is kept at 100% (mclk and
sclk) all the time while with this new kernel the GPU tries to scale the
performance.

Also, it is important to note that the nvidia GTX 1070 throws a lot of xid
error codes ( see
https://devtalk.nvidia.com/default/topic/1043483/linux/xid-errors-on-gtx-1070-linux/post/5293440
). And this is why I'm thinking that the 1800X has a defective pci-controller.
And it is also the second part of the "really bad news". Maybe it is happening
mostly with ryzen processors? I'll test the RX480 with the other computer ASAP,
need to send informations about the CPU for AMD to proceed with the warranty
process.

The GTX 1070 works without a single problem outside of this PC. The other cards
that I had tested before follows the same pattern ( 2 RX480, 1 RX 580, 1 GTX
970, 1 GTX 1070).

Currently I have only 1 RX480 and 1 GTX 1070. Now that I know that the cards
don't have any problem I'm selling the cards and soon I'll have only one or
none. The seller told me off because of requesting warranty for the RX 480 when
I thought it was defective, he sent me another different and the one that I
sent was working without any issues according to him.

I'm already in a new stage of re-sending the CPU for AMD, and praying to solve
my endless torment. I think that they'll have to refund me (and then I'll have
a loss with the motherboard).

Please tell me any other step that you may want to be done.

I can also provide a full description of the kernel compilation (parameters)
and even provide a link to the generated .deb packages.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20181120/d90b3d64/attachment.html>


More information about the dri-devel mailing list