[Bug 109955] amdgpu [RX Vega 64] system freeze while gaming

Sat Sep 7 03:48:56 UTC 2019

https://bugs.freedesktop.org/show_bug.cgi?id=109955

--- Comment #96 from Rodney A Morris <rodamorris at gmail.com> ---
(In reply to Mauro Gaspari from comment #90)

I am experiencing periodic lockups with various games, including Hearts of Iron
IV, BATTLETECH, and Stellaris all being played through Steam.  Below is the
most recent crash from playing less than 5 minutes of Hearts of Iron IV.

> 
> OS Info can be taken from neofetch:
> System info:

           /:-------------:\          
       :-------------------::        -------------------------------- 
     :-----------/shhOHbmp---:\      OS: Fedora release 30 (Thirty) x86_64 
   /-----------omMMMNNNMMD  ---:     Kernel: 5.2.11-200.fc30.x86_64+debug 
  :-----------sMMMMNMNMP.    ---:    Uptime: 11 mins 
 :-----------:MMMdP-------    ---\   Packages: 2198 (rpm), 27 (flatpak) 
,------------:MMMd--------    ---:   Shell: bash 5.0.7 
:------------:MMMd-------    .---:   Resolution: 2560x1440 
:----    oNMMMMMMMMMNho     .----:   DE: GNOME 3.32.2 
:--     .+shhhMMMmhhy++   .------/   WM: GNOME Shell 
:-    -------:MMMd--------------:    WM Theme: Adwaita 
:-   --------/MMMd-------------;     Theme: Adapta-Nokto-Eta [GTK2/3] 
:-    ------/hMMMy------------:      Icons: Adwaita [GTK2/3] 
:-- :dMNdhhdNMMNo------------;       Terminal: tilix 
:---:sdNMMMMNds:------------:        CPU: Intel i7-6850K (12) @ 4.000GHz 
:------:://:-------------::          GPU: AMD ATI Radeon RX Vega 56/64 
:---------------------://            Memory: 1666MiB / 32045MiB 

> 
> Mesa info can be taken from this command:
> glxinfo | grep "OpenGL version" 

OpenGL version string: 4.5 (Compatibility Profile) Mesa 19.1.5

> 
> Game being played: 

Hearts of Iron IV through Steam for Linux

> Native or Wine or Wine+DXVK:

Native

> 
> Crash type: Game crash? Full System freeze? System freeze but still can drop
> to tty?

Screen goes black suddenly while music continues plays for less than a minute;
music begins to loop; and computer reboots.

> 
> DMESG output after the crash:
> sudo dmesg | grep amdgpu

Here is the pertinent part dmesg with kernel debugging turned on.  Some of the
information the crash would not be captured by grepping amdgpu.  Entire dmesg
provided as an attachment.

[46957.810300] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for
fences timed out or interrupted!
[46962.941366] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
signaled seq=2446766, emitted seq=2446767
[46962.941453] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
process hoi4 pid 24014 thread hoi4:cs0 pid 24015
[46962.941459] amdgpu 0000:06:00.0: GPU reset begin!

[46962.942698] ======================================================
[46962.942700] WARNING: possible circular locking dependency detected
[46962.942702] 5.2.11-200.fc30.x86_64+debug #1 Not tainted
[46962.942704] ------------------------------------------------------
[46962.942705] kworker/3:0/20416 is trying to acquire lock:
[46962.942708] 00000000a4a3593f (&(&ring->fence_drv.lock)->rlock){-.-.}, at:
dma_fence_remove_callback+0x1a/0x60
[46962.942717] 
               but task is already holding lock:
[46962.942718] 00000000d45cbf2b (&(&sched->job_list_lock)->rlock){-.-.}, at:
drm_sched_stop+0x34/0x130 [gpu_sched]
[46962.942724] 
               which lock already depends on the new lock.

[46962.942725] 
               the existing dependency chain (in reverse order) is:
[46962.942727] 
               -> #1 (&(&sched->job_list_lock)->rlock){-.-.}:
[46962.942735]        _raw_spin_lock_irqsave+0x49/0x83
[46962.942738]        drm_sched_process_job+0x4d/0x180 [gpu_sched]
[46962.942741]        dma_fence_signal+0x111/0x1a0
[46962.942794]        amdgpu_fence_process+0xa3/0x100 [amdgpu]
[46962.942858]        sdma_v4_0_process_trap_irq+0x8d/0xa0 [amdgpu]
[46962.942918]        amdgpu_irq_dispatch+0xc0/0x250 [amdgpu]
[46962.942978]        amdgpu_ih_process+0x8d/0x110 [amdgpu]
[46962.943038]        amdgpu_irq_handler+0x1b/0x50 [amdgpu]
[46962.943043]        __handle_irq_event_percpu+0x3f/0x290
[46962.943046]        handle_irq_event_percpu+0x31/0x80
[46962.943048]        handle_irq_event+0x34/0x51
[46962.943053]        handle_edge_irq+0x83/0x1a0
[46962.943057]        handle_irq+0x1c/0x30
[46962.943059]        do_IRQ+0x61/0x120
[46962.943063]        ret_from_intr+0x0/0x22
[46962.943067]        cpuidle_enter_state+0xc9/0x450
[46962.943069]        cpuidle_enter+0x29/0x40
[46962.943074]        do_idle+0x1ec/0x280
[46962.943076]        cpu_startup_entry+0x19/0x20
[46962.943079]        start_secondary+0x189/0x1e0
[46962.943083]        secondary_startup_64+0xa4/0xb0
[46962.943087] 
               -> #0 (&(&ring->fence_drv.lock)->rlock){-.-.}:
[46962.943095]        lock_acquire+0xa2/0x1b0
[46962.943105]        _raw_spin_lock_irqsave+0x49/0x83
[46962.943109]        dma_fence_remove_callback+0x1a/0x60
[46962.943114]        drm_sched_stop+0x59/0x130 [gpu_sched]
[46962.943225]        amdgpu_device_pre_asic_reset+0x41/0x20c [amdgpu]
[46962.943338]        amdgpu_device_gpu_recover+0x77/0x788 [amdgpu]
[46962.943413]        amdgpu_job_timedout+0x109/0x130 [amdgpu]
[46962.943418]        drm_sched_job_timedout+0x40/0x70 [gpu_sched]
[46962.943421]        process_one_work+0x272/0x5e0
[46962.943423]        worker_thread+0x50/0x3b0
[46962.943427]        kthread+0x108/0x140
[46962.943431]        ret_from_fork+0x3a/0x50
[46962.943432] 
               other info that might help us debug this:

[46962.943435]  Possible unsafe locking scenario:

[46962.943437]        CPU0                    CPU1
[46962.943438]        ----                    ----
[46962.943439]   lock(&(&sched->job_list_lock)->rlock);
[46962.943441]                               
lock(&(&ring->fence_drv.lock)->rlock);
[46962.943443]                               
lock(&(&sched->job_list_lock)->rlock);
[46962.943445]   lock(&(&ring->fence_drv.lock)->rlock);
[46962.943447] 
                *** DEADLOCK ***

[46962.943449] 5 locks held by kworker/3:0/20416:
[46962.943450]  #0: 0000000043c92b99 ((wq_completion)events){+.+.}, at:
process_one_work+0x1e9/0x5e0
[46962.943456]  #1: 000000000c360f0c
((work_completion)(&(&sched->work_tdr)->work)){+.+.}, at:
process_one_work+0x1e9/0x5e0
[46962.943459]  #2: 000000007a135814 (&adev->lock_reset){+.+.}, at:
amdgpu_device_lock_adev+0x17/0x39 [amdgpu]
[46962.943543]  #3: 00000000e83f7d6b (&dqm->lock_hidden){+.+.}, at:
kgd2kfd_pre_reset+0x30/0x60 [amdgpu]
[46962.943614]  #4: 00000000d45cbf2b (&(&sched->job_list_lock)->rlock){-.-.},
at: drm_sched_stop+0x34/0x130 [gpu_sched]
[46962.943620] 
               stack backtrace:
[46962.943629] CPU: 3 PID: 20416 Comm: kworker/3:0 Not tainted
5.2.11-200.fc30.x86_64+debug #1
[46962.943631] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99
Taichi, BIOS P1.80 04/06/2018
[46962.943636] Workqueue: events drm_sched_job_timedout [gpu_sched]
[46962.943638] Call Trace:
[46962.943648]  dump_stack+0x85/0xc0
[46962.943654]  print_circular_bug.cold+0x15c/0x195
[46962.943658]  __lock_acquire+0x167c/0x1c90
[46962.943664]  lock_acquire+0xa2/0x1b0
[46962.943668]  ? dma_fence_remove_callback+0x1a/0x60
[46962.943674]  _raw_spin_lock_irqsave+0x49/0x83
[46962.943677]  ? dma_fence_remove_callback+0x1a/0x60
[46962.943680]  dma_fence_remove_callback+0x1a/0x60
[46962.943684]  drm_sched_stop+0x59/0x130 [gpu_sched]
[46962.943764]  amdgpu_device_pre_asic_reset+0x41/0x20c [amdgpu]
[46962.943847]  amdgpu_device_gpu_recover+0x77/0x788 [amdgpu]
[46962.943923]  amdgpu_job_timedout+0x109/0x130 [amdgpu]
[46962.943930]  drm_sched_job_timedout+0x40/0x70 [gpu_sched]
[46962.943934]  process_one_work+0x272/0x5e0
[46962.943938]  worker_thread+0x50/0x3b0
[46962.943942]  kthread+0x108/0x140
[46962.943945]  ? process_one_work+0x5e0/0x5e0
[46962.943948]  ? kthread_park+0x80/0x80
[46962.943952]  ret_from_fork+0x3a/0x50
[46962.961034] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error
received: 0000:00:03.0
[46962.961044] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected
(Non-Fatal), type=Transaction Layer, (Requester ID)
[46962.961048] pcieport 0000:00:03.0: AER:   device [8086:6f08] error
status/mask=00004000/00000000
[46962.961051] pcieport 0000:00:03.0: AER:    [14] CmpltTO               
(First)
[46962.961149] pcieport 0000:00:03.0: AER: Device recovery failed
[46963.955209] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring page1 timeout,
signaled seq=95391072, emitted seq=95391072
[46963.955328] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
process  pid 0 thread  pid 0
[46963.955336] amdgpu 0000:06:00.0: GPU reset begin!
[46968.050083] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]]
*ERROR* [CRTC:47:crtc-0] flip_done timed out
[46973.170223] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:47:crtc-0]
hw_done or flip_done timed out
[46983.410080] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]]
*ERROR* [CRTC:47:crtc-0] flip_done timed out
[46993.650098] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]]
*ERROR* [PLANE:45:plane-5] flip_done timed out
[46993.962192] amdgpu: [powerplay] No response from smu
[46993.962195] amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0,
error code: 0x0
[46994.277773] amdgpu: [powerplay] No response from smu
[46994.593416] amdgpu: [powerplay] No response from smu
[46994.593420] amdgpu: [powerplay] Failed message: 0x42, input parameter: 0x1,
error code: 0x0
[46994.908354] amdgpu: [powerplay] No response from smu
[46995.223718] amdgpu: [powerplay] No response from smu
[46995.223722] amdgpu: [powerplay] Failed message: 0x24, input parameter: 0x0,
error code: 0x0
[46995.286504] [drm] REG_WAIT timeout 10us * 3500 tries - dce_mi_free_dmif
line:634
[46995.286506] ------------[ cut here ]------------
[46995.286605] WARNING: CPU: 3 PID: 20416 at
drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:329
generic_reg_wait.cold+0x31/0x53 [amdgpu]
[46995.286606] Modules linked in: vhost_net vhost tap rfcomm xt_CHECKSUM
xt_MASQUERADE tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast
xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4
xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw
ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw
iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set
nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
ip_tables bnep nct6775 hwmon_vid intel_rapl vfat fat arc4 x86_pkg_temp_thermal
intel_powerclamp coretemp fuse kvm_intel kvm iwlmvm irqbypass iTCO_wdt
iTCO_vendor_support mac80211 crct10dif_pclmul crc32_pclmul
snd_hda_codec_realtek ghash_clmulni_intel intel_cstate snd_hda_codec_generic
iwlwifi snd_hda_codec_hdmi ledtrig_audio intel_uncore snd_hda_intel
intel_rapl_perf cfg80211 snd_hda_codec btusb mxm_wmi snd_hda_core btrtl btbcm
snd_hwdep btintel snd_seq i2c_i801 lpc_ich bluetooth
[46995.286626]  snd_seq_device joydev snd_pcm ecdh_generic snd_timer rfkill ecc
mei_me snd mei soundcore pcc_cpufreq binfmt_misc auth_rpcgss sunrpc amdgpu
amd_iommu_v2 gpu_sched ttm drm_kms_helper crc32c_intel igb uas drm usb_storage
dca mpt3sas i2c_algo_bit e1000e nvme raid_class nvme_core scsi_transport_sas
wmi
[46995.286638] CPU: 3 PID: 20416 Comm: kworker/3:0 Not tainted
5.2.11-200.fc30.x86_64+debug #1
[46995.286639] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99
Taichi, BIOS P1.80 04/06/2018
[46995.286643] Workqueue: events drm_sched_job_timedout [gpu_sched]
[46995.286682] RIP: 0010:generic_reg_wait.cold+0x31/0x53 [amdgpu]
[46995.286684] Code: 4c 24 18 44 89 fa 89 ee 48 c7 c7 78 93 80 c0 e8 45 fd a0
ca 83 7b 20 01 0f 84 27 11 fe ff 48 c7 c7 70 92 80 c0 e8 2f fd a0 ca <0f> 0b e9
14 11 fe ff 48 c7 c7 70 92 80 c0 89 54 24 04 e8 18 fd a0
[46995.286685] RSP: 0018:ffff9cd009b3f728 EFLAGS: 00010246
[46995.286687] RAX: 0000000000000024 RBX: ffff8ada6be8a780 RCX:
0000000000000006
[46995.286688] RDX: 0000000000000000 RSI: 0000000000000001 RDI:
ffff8ada7ebd9c80
[46995.286689] RBP: 000000000000000a R08: 0000000000000001 R09:
0000000000000000
[46995.286690] R10: 0000000000000000 R11: 0000000000000000 R12:
00000000000035af
[46995.286691] R13: 0000000000000dad R14: 0000000000000001 R15:
0000000000000dac
[46995.286692] FS:  0000000000000000(0000) GS:ffff8ada7ea00000(0000)
knlGS:0000000000000000
[46995.286694] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[46995.286695] CR2: 0000085777c78000 CR3: 00000003cb612005 CR4:
00000000003606e0
[46995.286696] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[46995.286697] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[46995.286698] Call Trace:
[46995.286741]  dce_mi_free_dmif+0xef/0x150 [amdgpu]
[46995.286780]  dce110_reset_hw_ctx_wrap+0x14a/0x1e0 [amdgpu]
[46995.286819]  dce110_apply_ctx_to_hw+0x4a/0x490 [amdgpu]
[46995.286843]  ? amdgpu_pm_compute_clocks.part.0+0xcb/0x610 [amdgpu]
[46995.286882]  ? dm_pp_apply_display_requirements+0x19e/0x1c0 [amdgpu]
[46995.286920]  dc_commit_state+0x262/0x580 [amdgpu]
[46995.286925]  ? vsnprintf+0x3aa/0x4f0
[46995.286965]  amdgpu_dm_atomic_commit_tail+0xc34/0x1970 [amdgpu]
[46995.286971]  ? console_unlock+0x363/0x5d0
[46995.286976]  ? __irq_work_queue_local+0x50/0x60
[46995.286977]  ? irq_work_queue+0x4d/0x60
[46995.286979]  ? wake_up_klogd+0x37/0x40
[46995.286984]  ? wait_for_completion_timeout+0x4c/0x190
[46995.286987]  ? _raw_spin_unlock_irq+0x29/0x40
[46995.286989]  ? wait_for_completion_timeout+0x75/0x190
[46995.287016]  ? commit_tail+0x3c/0x70 [drm_kms_helper]
[46995.287021]  commit_tail+0x3c/0x70 [drm_kms_helper]
[46995.287026]  drm_atomic_helper_commit+0xe3/0x150 [drm_kms_helper]
[46995.287031]  drm_atomic_helper_disable_all+0x14c/0x160 [drm_kms_helper]
[46995.287035]  drm_atomic_helper_suspend+0x66/0x100 [drm_kms_helper]
[46995.287076]  dm_suspend+0x20/0x60 [amdgpu]
[46995.287098]  amdgpu_device_ip_suspend_phase1+0x91/0xc0 [amdgpu]
[46995.287123]  amdgpu_device_ip_suspend+0x1c/0x60 [amdgpu]
[46995.287164]  amdgpu_device_pre_asic_reset+0x1f7/0x20c [amdgpu]
[46995.287204]  amdgpu_device_gpu_recover+0x77/0x788 [amdgpu]
[46995.287242]  amdgpu_job_timedout+0x109/0x130 [amdgpu]
[46995.287246]  drm_sched_job_timedout+0x40/0x70 [gpu_sched]
[46995.287249]  process_one_work+0x272/0x5e0
[46995.287252]  worker_thread+0x50/0x3b0
[46995.287256]  kthread+0x108/0x140
[46995.287258]  ? process_one_work+0x5e0/0x5e0
[46995.287260]  ? kthread_park+0x80/0x80
[46995.287263]  ret_from_fork+0x3a/0x50
[46995.287267] irq event stamp: 6288284
[46995.287269] hardirqs last  enabled at (6288283): [<ffffffff8bb04d8b>]
_raw_spin_unlock_irqrestore+0x4b/0x60
[46995.287271] hardirqs last disabled at (6288284): [<ffffffff8bb05533>]
_raw_spin_lock_irqsave+0x23/0x83
[46995.287273] softirqs last  enabled at (6288276): [<ffffffff8be0035d>]
__do_softirq+0x35d/0x468
[46995.287276] softirqs last disabled at (6288269): [<ffffffff8b0f07a2>]
irq_exit+0x102/0x110
[46995.287277] ---[ end trace 6a2158c4cfef5172 ]---
[46995.603082] amdgpu: [powerplay] No response from smu
[46995.918767] amdgpu: [powerplay] No response from smu
[46995.918770] amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x1,
error code: 0x0
[46996.233769] amdgpu: [powerplay] No response from smu
[46996.549255] amdgpu: [powerplay] No response from smu
[46996.549258] amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x3,
error code: 0x0
[46996.865320] amdgpu: [powerplay] No response from smu
[46997.181203] amdgpu: [powerplay] No response from smu
[46997.181206] amdgpu: [powerplay] Failed message: 0x9, input parameter: 0xf4,
error code: 0x0
[46997.495804] amdgpu: [powerplay] No response from smu
[46997.811227] amdgpu: [powerplay] No response from smu
[46997.811231] amdgpu: [powerplay] Failed message: 0xa, input parameter:
0xa0b000, error code: 0x0
[46998.126794] amdgpu: [powerplay] No response from smu
[46998.442559] amdgpu: [powerplay] No response from smu
[46998.442561] amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0,
error code: 0x0
[46998.756884] amdgpu: [powerplay] No response from smu
[46999.072680] amdgpu: [powerplay] No response from smu
[46999.072684] amdgpu: [powerplay] Failed message: 0x4, input parameter: 0x400,
error code: 0x0
[46999.388310] amdgpu: [powerplay] No response from smu
[46999.704067] amdgpu: [powerplay] No response from smu
[46999.704069] amdgpu: [powerplay] Failed message: 0x42, input parameter: 0x1,
error code: 0x0
[47000.019626] amdgpu: [powerplay] No response from smu
[47000.334247] amdgpu: [powerplay] No response from smu
[47000.334251] amdgpu: [powerplay] Failed message: 0x24, input parameter: 0x0,
error code: 0x0
[47000.350026] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error
received: 0000:00:03.0
[47000.350043] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected
(Non-Fatal), type=Transaction Layer, (Requester ID)
[47000.350052] pcieport 0000:00:03.0: AER:   device [8086:6f08] error
status/mask=00004000/00000000
[47000.350061] pcieport 0000:00:03.0: AER:    [14] CmpltTO               
(First)
[47000.350202] pcieport 0000:00:03.0: AER: Device recovery failed
[47000.367437] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error
received: 0000:00:03.0
[47000.367443] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected
(Non-Fatal), type=Transaction Layer, (Requester ID)
[47000.367444] pcieport 0000:00:03.0: AER:   device [8086:6f08] error
status/mask=00004000/00000000
[47000.367446] pcieport 0000:00:03.0: AER:    [14] CmpltTO               
(First)
[47000.367486] pcieport 0000:00:03.0: AER: Device recovery failed
[47000.384977] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error
received: 0000:00:03.0
[47000.384982] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected
(Non-Fatal), type=Transaction Layer, (Requester ID)
[47000.384983] pcieport 0000:00:03.0: AER:   device [8086:6f08] error
status/mask=00004000/00000000
[47000.384985] pcieport 0000:00:03.0: AER:    [14] CmpltTO               
(First)
[47000.385055] pcieport 0000:00:03.0: AER: Device recovery failed
[47000.402521] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error
received: 0000:00:03.0
[47000.402530] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected
(Non-Fatal), type=Transaction Layer, (Requester ID)
[47000.402532] pcieport 0000:00:03.0: AER:   device [8086:6f08] error
status/mask=00004000/00000000
[47000.402535] pcieport 0000:00:03.0: AER:    [14] CmpltTO               
(First)
[47000.402578] pcieport 0000:00:03.0: AER: Device recovery failed
[47000.420068] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error
received: 0000:00:03.0
[47000.420079] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected
(Non-Fatal), type=Transaction Layer, (Requester ID)
[47000.420085] pcieport 0000:00:03.0: AER:   device [8086:6f08] error
status/mask=00004000/00000000
[47000.420090] pcieport 0000:00:03.0: AER:    [14] CmpltTO               
(First)
[47000.420186] pcieport 0000:00:03.0: AER: Device recovery failed
[47000.437608] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error
received: 0000:00:03.0
[47000.437617] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected
(Non-Fatal), type=Transaction Layer, (Requester ID)
[47000.437621] pcieport 0000:00:03.0: AER:   device [8086:6f08] error
status/mask=00004000/00000000
[47000.437625] pcieport 0000:00:03.0: AER:    [14] CmpltTO               
(First)
[47000.437726] pcieport 0000:00:03.0: AER: Device recovery failed
[47000.455143] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error
received: 0000:00:03.0
[47000.455151] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected
(Non-Fatal), type=Transaction Layer, (Requester ID)
[47000.455154] pcieport 0000:00:03.0: AER:   device [8086:6f08] error
status/mask=00004000/00000000
[47000.455157] pcieport 0000:00:03.0: AER:    [14] CmpltTO               
(First)
[47000.455209] pcieport 0000:00:03.0: AER: Device recovery failed
[47000.472688] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error
received: 0000:00:03.0
[47000.472698] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected
(Non-Fatal), type=Transaction Layer, (Requester ID)
[47000.472703] pcieport 0000:00:03.0: AER:   device [8086:6f08] error
status/mask=00004000/00000000
[47000.472708] pcieport 0000:00:03.0: AER:    [14] CmpltTO               
(First)
[47000.472826] pcieport 0000:00:03.0: AER: Device recovery failed
[47000.490225] pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error
received: 0000:00:03.0
[47000.490232] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected
(Non-Fatal), type=Transaction Layer, (Requester ID)
[47000.490236] pcieport 0000:00:03.0: AER:   device [8086:6f08] error
status/mask=00004000/00000000
[47000.490239] pcieport 0000:00:03.0: AER:    [14] CmpltTO               
(First)
[47000.490289] pcieport 0000:00:03.0: AER: Device recovery failed
[47000.507760] pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal)
error received: 0000:00:03.0
[47000.735787] pcieport 0000:00:03.0: AER: PCIe Bus Error: severity=Uncorrected
(Non-Fatal), type=Transaction Layer, (Requester ID)
[47000.735791] pcieport 0000:00:03.0: AER:   device [8086:6f08] error
status/mask=00004000/00000000
[47000.735793] pcieport 0000:00:03.0: AER:    [14] CmpltTO               
(First)
[47000.735824] pcieport 0000:00:03.0: AER: Device recovery failed
[47000.735826] pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal)
error received: 0000:00:03.0

> systemd logs output after the crash (If your system froze and you get logs
> after reboot):

Sep 06 08:36:58 ezra.blanchardmorris.net kernel: Command line:
BOOT_IMAGE=(hd4,gpt6)/vmlinuz-5.2.11-200.fc30.x86_64+debug
root=UUID=e7b8b34a-e17f-4c2b-b223-eaa636249d2d ro
resume=UUID=52cc8cd8-b06f-4613-8781-a105d0ebf44a rhgb quiet amdgpu.vm_debug=1
Sep 06 08:36:58 ezra.blanchardmorris.net kernel: Kernel command line:
BOOT_IMAGE=(hd4,gpt6)/vmlinuz-5.2.11-200.fc30.x86_64+debug
root=UUID=e7b8b34a-e17f-4c2b-b223-eaa636249d2d ro
resume=UUID=52cc8cd8-b06f-4613-8781-a105d0ebf44a rhgb quiet amdgpu.vm_debug=1
Sep 06 08:36:59 ezra.blanchardmorris.net dracut-cmdline[361]: Using kernel
command line parameters:
BOOT_IMAGE=(hd4,gpt6)/vmlinuz-5.2.11-200.fc30.x86_64+debug
root=UUID=e7b8b34a-e17f-4c2b-b223-eaa636249d2d ro
resume=UUID=52cc8cd8-b06f-4613-8781-a105d0ebf44a rhgb quiet amdgpu.vm_debug=1
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: [drm] amdgpu kernel
modesetting enabled.
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0:
remove_conflicting_pci_framebuffers: bar 0: 0xe0000000 -> 0xefffffff
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0:
remove_conflicting_pci_framebuffers: bar 2: 0xf0000000 -> 0xf01fffff
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0:
remove_conflicting_pci_framebuffers: bar 5: 0xfb600000 -> 0xfb67ffff
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: fb0: switching to amdgpudrmfb
from EFI VGA
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: vgaarb:
deactivate vga console
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: No more
image in the PCI ROM
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: VRAM:
8176M 0x000000F400000000 - 0x000000F5FEFFFFFF (8176M used)
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: GART:
512M 0x0000000000000000 - 0x000000001FFFFFFF
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: AGP:
267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: [drm] amdgpu: 8176M of VRAM
memory ready
Sep 06 08:37:00 ezra.blanchardmorris.net kernel: [drm] amdgpu: 8176M of GTT
memory ready.
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: fbcon: amdgpudrmfb (fb0) is
primary device
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: fb0:
amdgpudrmfb frame buffer device
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring gfx
uses VM inv eng 0 on hub 0
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring
comp_1.0.0 uses VM inv eng 1 on hub 0
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring
comp_1.1.0 uses VM inv eng 4 on hub 0
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring
comp_1.2.0 uses VM inv eng 5 on hub 0
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring
comp_1.3.0 uses VM inv eng 6 on hub 0
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring
comp_1.0.1 uses VM inv eng 7 on hub 0
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring
comp_1.1.1 uses VM inv eng 8 on hub 0
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring
comp_1.2.1 uses VM inv eng 9 on hub 0
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring
comp_1.3.1 uses VM inv eng 10 on hub 0
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring
kiq_2.1.0 uses VM inv eng 11 on hub 0
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring
sdma0 uses VM inv eng 0 on hub 1
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring
page0 uses VM inv eng 1 on hub 1
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring
sdma1 uses VM inv eng 4 on hub 1
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring
page1 uses VM inv eng 5 on hub 1
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring
uvd_0 uses VM inv eng 6 on hub 1
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring
uvd_enc_0.0 uses VM inv eng 7 on hub 1
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring
uvd_enc_0.1 uses VM inv eng 8 on hub 1
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring vce0
uses VM inv eng 9 on hub 1
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring vce1
uses VM inv eng 10 on hub 1
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: ring vce2
uses VM inv eng 11 on hub 1
Sep 06 08:37:01 ezra.blanchardmorris.net kernel: [drm] Initialized amdgpu
3.32.0 20150101 for 0000:06:00.0 on minor 0
Sep 06 08:37:48 ezra.blanchardmorris.net /usr/libexec/gdm-x-session[1969]:
Kernel command line: BOOT_IMAGE=(hd4,gpt6)/vmlinuz-5.2.11-200.fc30.x86_64+debug
root=UUID=e7b8b34a-e17f-4c2b-b223-eaa636249d2d ro
resume=UUID=52cc8cd8-b06f-4613-8781-a105d0ebf44a rhgb quiet amdgpu.vm_debug=1
Sep 06 08:37:48 ezra.blanchardmorris.net /usr/libexec/gdm-x-session[1969]:     
   loading driver: amdgpu
Sep 06 08:37:48 ezra.blanchardmorris.net /usr/libexec/gdm-x-session[1969]: (==)
Matched amdgpu as autoconfigured driver 0
Sep 06 08:37:48 ezra.blanchardmorris.net /usr/libexec/gdm-x-session[1969]: (II)
LoadModule: "amdgpu"
Sep 06 08:37:48 ezra.blanchardmorris.net /usr/libexec/gdm-x-session[1969]: (II)
Loading /usr/lib64/xorg/modules/drivers/amdgpu_drv.so
Sep 06 08:37:48 ezra.blanchardmorris.net /usr/libexec/gdm-x-session[1969]: (II)
Module amdgpu: vendor="X.Org Foundation"
Sep 06 08:37:48 ezra.blanchardmorris.net /usr/libexec/gdm-x-session[1969]:     
   All GPUs supported by the amdgpu kernel driver
Sep 06 16:13:18 ezra.blanchardmorris.net net.lutris.Lutris.desktop[2234]:
2019-09-06 16:13:18,530: GPU: 1002:687F 1002:0B36 using amdgpu drivers
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:
[drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed
out or interrupted!
Sep 06 21:39:39 ezra.blanchardmorris.net kernel: [drm:amdgpu_job_timedout
[amdgpu]] *ERROR* ring gfx timeout, signaled seq=2446766, emitted seq=2446767
Sep 06 21:39:39 ezra.blanchardmorris.net kernel: [drm:amdgpu_job_timedout
[amdgpu]] *ERROR* Process information: process hoi4 pid 24014 thread hoi4:cs0
pid 24015
Sep 06 21:39:39 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: GPU reset
begin!
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:       
amdgpu_fence_process+0xa3/0x100 [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:       
sdma_v4_0_process_trap_irq+0x8d/0xa0 [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:       
amdgpu_irq_dispatch+0xc0/0x250 [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:       
amdgpu_ih_process+0x8d/0x110 [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:       
amdgpu_irq_handler+0x1b/0x50 [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:       
amdgpu_device_pre_asic_reset+0x41/0x20c [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:       
amdgpu_device_gpu_recover+0x77/0x788 [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:       
amdgpu_job_timedout+0x109/0x130 [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:  #2: 000000007a135814
(&adev->lock_reset){+.+.}, at: amdgpu_device_lock_adev+0x17/0x39 [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel:  #3: 00000000e83f7d6b
(&dqm->lock_hidden){+.+.}, at: kgd2kfd_pre_reset+0x30/0x60 [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel: 
amdgpu_device_pre_asic_reset+0x41/0x20c [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel: 
amdgpu_device_gpu_recover+0x77/0x788 [amdgpu]
Sep 06 21:39:39 ezra.blanchardmorris.net kernel: 
amdgpu_job_timedout+0x109/0x130 [amdgpu]
Sep 06 21:39:40 ezra.blanchardmorris.net kernel: [drm:amdgpu_job_timedout
[amdgpu]] *ERROR* ring page1 timeout, signaled seq=95391072, emitted
seq=95391072
Sep 06 21:39:40 ezra.blanchardmorris.net kernel: [drm:amdgpu_job_timedout
[amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Sep 06 21:39:40 ezra.blanchardmorris.net kernel: amdgpu 0000:06:00.0: GPU reset
begin!
Sep 06 21:39:49 ezra.blanchardmorris.net kernel: [drm:amdgpu_dm_atomic_check
[amdgpu]] *ERROR* [CRTC:47:crtc-0] hw_done or flip_done timed out
Sep 06 21:40:10 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No
response from smu
Sep 06 21:40:10 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] Failed
message: 0xe, input parameter: 0x0, error code: 0x0
Sep 06 21:40:10 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No
response from smu
Sep 06 21:40:10 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No
response from smu
Sep 06 21:40:10 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] Failed
message: 0x42, input parameter: 0x1, error code: 0x0
Sep 06 21:40:11 ezra.blanchardmorris.net kernel: amdgpu: [powerplay] No
response from smu

I will try to run apitrace on Hearts of Iron IV to try to capture more
information.  Please let me know if I can be of further assistance in squashing
this annoying bug, like providing crash information with the mesa debug
packages installed.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20190907/cdd475e1/attachment-0001.html>