<html>
<head>
<base href="https://bugs.freedesktop.org/">
</head>
<body>
<p>
<div>
<b><a class="bz_bug_link
bz_status_NEW "
title="NEW - ring_gfx hangs/freezes on Navi gpus"
href="https://bugs.freedesktop.org/show_bug.cgi?id=111763#c24">Comment # 24</a>
on <a class="bz_bug_link
bz_status_NEW "
title="NEW - ring_gfx hangs/freezes on Navi gpus"
href="https://bugs.freedesktop.org/show_bug.cgi?id=111763">bug 111763</a>
from <span class="vcard"><a class="email" href="mailto:wychuchol7777@gmail.com" title="wychuchol <wychuchol7777@gmail.com>"> <span class="fn">wychuchol</span></a>
</span></b>
<pre>(In reply to wychuchol from <a href="show_bug.cgi?id=111763#c23">comment #23</a>)
<span class="quote">> (In reply to wychuchol from <a href="show_bug.cgi?id=111763#c19">comment #19</a>)
> > After some time in Witcher 3 GOTY run with Lutris PC restarts on it's own. I
> > thought something is overheating (I've noticed graphic card memory in
> > PSensor sometimes reaching 90 so I thought maybe that's what's happening)
> > but I investigated kern.log and this always happened before that autonomous
> > reset:
> >
> > Nov 2 22:01:53 pop-os kernel: [ 979.244964] pcieport 0000:00:01.1: AER:
> > Corrected error received: 0000:01:00.0
> > Nov 2 22:01:53 pop-os kernel: [ 979.244967] nvme 0000:01:00.0: AER: PCIe
> > Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
> > Nov 2 22:01:53 pop-os kernel: [ 979.244968] nvme 0000:01:00.0: AER:
> > device [1987:5012] error status/mask=00001000/00006000
> > Nov 2 22:01:53 pop-os kernel: [ 979.244968] nvme 0000:01:00.0: AER:
> > [12] Timeout
> > Nov 2 22:01:53 pop-os kernel: [ 979.262629] Emergency Sync complete
>
> Thing with those AER errors is that they can go on and on and reset happens
> few minutes after the last logged error.
> This might be overheating, I managed to find how to output sensors readings
> into txt log and found that memory went up to 96 C (or rather it stayed
> there for about 1m 10s)
> Last reading before reset:
> amdgpu-pci-2800
> Adapter: PCI adapter
> vddgfx: +1.16 V
> fan1: 1551 RPM (min = 0 RPM, max = 3200 RPM)
> edge: +74.0°C (crit = +118.0°C, hyst = -273.1°C)
> (emerg = +99.0°C)
> junction: +88.0°C (crit = +99.0°C, hyst = -273.1°C)
> (emerg = +99.0°C)
> mem: +96.0°C (crit = +99.0°C, hyst = -273.1°C)
> (emerg = +99.0°C)
> power1: 162.00 W (cap = 195.00 W)
>
> k10temp-pci-00c3
> Adapter: PCI adapter
> Tdie: +70.5°C (high = +70.0°C)
> Tctl: +70.5°C
>
> Now the weird thing is - if this is in fact overheating why fan didn't go
> beyond 1600 rpm even once.... Highest was like 1581 rpm and I don't have
> silent bios switched on (sapphire pulse rx 5700 xt, lever facing away from
> video ports).</span >
Okay I don't think it's overheating anymore. I found a moment in Anomaly 1.5.0
I can't get past without system resetting, just before a psi storm in Army
Warehouses (I can provide a savefile).
Last sensors reading before crash (5 second increments):
amdgpu-pci-2800
Adapter: PCI adapter
vddgfx: +1.01 V
fan1: 1560 RPM (min = 0 RPM, max = 3200 RPM)
edge: +69.0°C (crit = +118.0°C, hyst = -273.1°C)
(emerg = +99.0°C)
junction: +84.0°C (crit = +99.0°C, hyst = -273.1°C)
(emerg = +99.0°C)
mem: +80.0°C (crit = +99.0°C, hyst = -273.1°C)
(emerg = +99.0°C)
power1: 227.00 W (cap = 195.00 W)
k10temp-pci-00c3
Adapter: PCI adapter
Tdie: +71.8°C (high = +70.0°C)
Tctl: +71.8°C</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are the assignee for the bug.</li>
</ul>
</body>
</html>