<html> <head> <base href="https://bugs.freedesktop.org/"> </head> <body> <div> <a class="bz_bug_link bz_status_NEW " title="NEW - ring_gfx hangs/freezes on Navi gpus" href="https://bugs.freedesktop.org/show_bug.cgi?id=111763#c24">Comment # 24</a> on <a class="bz_bug_link bz_status_NEW " title="NEW - ring_gfx hangs/freezes on Navi gpus" href="https://bugs.freedesktop.org/show_bug.cgi?id=111763">bug 111763</a> from <a class="email" href="mailto:wychuchol7777@gmail.com" title="wychuchol <wychuchol7777@gmail.com>"> wychuchol</a> <pre>(In reply to wychuchol from <a href="show_bug.cgi?id=111763#c23">comment #23</a>) > (In reply to wychuchol from <a href="show_bug.cgi?id=111763#c19">comment #19</a>) > > After some time in Witcher 3 GOTY run with Lutris PC restarts on it's own. I > > thought something is overheating (I've noticed graphic card memory in > > PSensor sometimes reaching 90 so I thought maybe that's what's happening) > > but I investigated kern.log and this always happened before that autonomous > > reset: > > > > Nov 2 22:01:53 pop-os kernel: [ 979.244964] pcieport 0000:00:01.1: AER: > > Corrected error received: 0000:01:00.0 > > Nov 2 22:01:53 pop-os kernel: [ 979.244967] nvme 0000:01:00.0: AER: PCIe > > Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) > > Nov 2 22:01:53 pop-os kernel: [ 979.244968] nvme 0000:01:00.0: AER: > > device [1987:5012] error status/mask=00001000/00006000 > > Nov 2 22:01:53 pop-os kernel: [ 979.244968] nvme 0000:01:00.0: AER: > > [12] Timeout > > Nov 2 22:01:53 pop-os kernel: [ 979.262629] Emergency Sync complete > > Thing with those AER errors is that they can go on and on and reset happens > few minutes after the last logged error. > This might be overheating, I managed to find how to output sensors readings > into txt log and found that memory went up to 96 C (or rather it stayed > there for about 1m 10s) > Last reading before reset: > amdgpu-pci-2800 > Adapter: PCI adapter > vddgfx: +1.16 V > fan1: 1551 RPM (min = 0 RPM, max = 3200 RPM) > edge: +74.0°C (crit = +118.0°C, hyst = -273.1°C) > (emerg = +99.0°C) > junction: +88.0°C (crit = +99.0°C, hyst = -273.1°C) > (emerg = +99.0°C) > mem: +96.0°C (crit = +99.0°C, hyst = -273.1°C) > (emerg = +99.0°C) > power1: 162.00 W (cap = 195.00 W) > > k10temp-pci-00c3 > Adapter: PCI adapter > Tdie: +70.5°C (high = +70.0°C) > Tctl: +70.5°C > > Now the weird thing is - if this is in fact overheating why fan didn't go > beyond 1600 rpm even once.... Highest was like 1581 rpm and I don't have > silent bios switched on (sapphire pulse rx 5700 xt, lever facing away from > video ports). Okay I don't think it's overheating anymore. I found a moment in Anomaly 1.5.0 I can't get past without system resetting, just before a psi storm in Army Warehouses (I can provide a savefile). Last sensors reading before crash (5 second increments): amdgpu-pci-2800 Adapter: PCI adapter vddgfx: +1.01 V fan1: 1560 RPM (min = 0 RPM, max = 3200 RPM) edge: +69.0°C (crit = +118.0°C, hyst = -273.1°C) (emerg = +99.0°C) junction: +84.0°C (crit = +99.0°C, hyst = -273.1°C) (emerg = +99.0°C) mem: +80.0°C (crit = +99.0°C, hyst = -273.1°C) (emerg = +99.0°C) power1: 227.00 W (cap = 195.00 W) k10temp-pci-00c3 Adapter: PCI adapter Tdie: +71.8°C (high = +70.0°C) Tctl: +71.8°C</pre> </div> <hr> You are receiving this mail because: <ul> <li>You are the assignee for the bug.</li> </ul> </body> </html>