[Bug 111763] ring_gfx hangs/freezes on Navi gpus

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Tue Nov 5 06:07:33 UTC 2019


https://bugs.freedesktop.org/show_bug.cgi?id=111763

--- Comment #23 from wychuchol <wychuchol7777 at gmail.com> ---
(In reply to wychuchol from comment #19)
> After some time in Witcher 3 GOTY run with Lutris PC restarts on it's own. I
> thought something is overheating (I've noticed graphic card memory in
> PSensor sometimes reaching 90 so I thought maybe that's what's happening)
> but I investigated kern.log and this always happened before that autonomous
> reset:
> 
> Nov  2 22:01:53 pop-os kernel: [  979.244964] pcieport 0000:00:01.1: AER:
> Corrected error received: 0000:01:00.0
> Nov  2 22:01:53 pop-os kernel: [  979.244967] nvme 0000:01:00.0: AER: PCIe
> Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
> Nov  2 22:01:53 pop-os kernel: [  979.244968] nvme 0000:01:00.0: AER:  
> device [1987:5012] error status/mask=00001000/00006000
> Nov  2 22:01:53 pop-os kernel: [  979.244968] nvme 0000:01:00.0: AER:   
> [12] Timeout               
> Nov  2 22:01:53 pop-os kernel: [  979.262629] Emergency Sync complete

Thing with those AER errors is that they can go on and on and reset happens few
minutes after the last logged error. 
This might be overheating, I managed to find how to output sensors readings
into txt log and found that memory went up to 96 C (or rather it stayed there
for about 1m 10s)
Last reading before reset:
amdgpu-pci-2800
Adapter: PCI adapter
vddgfx:       +1.16 V  
fan1:        1551 RPM  (min =    0 RPM, max = 3200 RPM)
edge:         +74.0°C  (crit = +118.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
junction:     +88.0°C  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
mem:          +96.0°C  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
power1:      162.00 W  (cap = 195.00 W)

k10temp-pci-00c3
Adapter: PCI adapter
Tdie:         +70.5°C  (high = +70.0°C)
Tctl:         +70.5°C  

Now the weird thing is - if this is in fact overheating why fan didn't go
beyond 1600 rpm even once.... Highest was like 1581 rpm and I don't have silent
bios switched on (sapphire pulse rx 5700 xt, lever facing away from video
ports).

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20191105/002fdd60/attachment-0001.html>


More information about the dri-devel mailing list