Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?

someguy108 someguy108 at gmail.com
Thu Apr 2 11:11:46 UTC 2020


Hello! I saw Clemens Eisserer email regarding MCE errors with his RX 570
and 3700x, and I like to add to that list of MCE spontaneous reboots as
well.
This is my configuration:
-Ryzen 3900x + Noctua D15
-MSI X570 Unify (latest agesa as of writing)
-DDR4 3200mhz 32GB kit
-Sapphire Pulse 5700 XT
-Corsair RMX 850 Watt
-Arch Linux with kernel 5.5.13
-Mesa 20.0.3
-Early KMS enabled

I've had this system up and running since November 2019 but initially with
a Nvidia 1060 and Windows 10. Everything was running smoothly. About a
month ago I switched back over to Linux after purchasing my 5700 XT as my
initial plan was to go back to Linux. Since returning I've experienced
multiple spontaneous MCE reboots. All happened while I was playing one
particular game, Warcraft 3 Reforged. The MCE event is the following:

kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5:
bea0000000000108
kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffad66d6fe MISC
d012000100000000 SYND 4d000000 IPID 500b000000000
kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1585120217 SOCKET 0
APIC 2 microcode 8701013
kernel: #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15
kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: CPU 15: Machine Check: 0 Bank 5:
bea0000000000108
kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffc1196eb6 MISC
d012000100000000 SYND 4d000000 IPID 500b000000000
kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1585120217 SOCKET 0
APIC 9 microcode 8701013
kernel: #16 #17 #18 #19 #20 #21 #22 #23

Initially I figured it could be ram so I performed the usual test with no
problems. Also tested with standard JEDEC as well and eventually received a
MCE during Warcraft 3 reforged. After consulting with a few friends I
decided to try a different power supply to no avail. I then bit the bullet
and bought a brand new 3900x. I also cleared CMOS before getting my new
3900x and after. All cpu values are on auto with no PBO or manual
overclocking. The only fancy is the ram. Yesterday, after owning the new
3900x for three days, I had a MCE while I was playing Warcraft 3 Reforged.
I have tested other games but none of them caused a MCE or any crashes /
freezes for that matter. World of Warcraft, The Outer Worlds, Stellaris,
and Counter-Strike: Global Offensive.

As the same with Clemens, using the same decoder he
used, MCE-Ryzen-Decoder, from github, it reports the MCE to be the
following:
Bank: Execution Unit (EX)
Error: Watchdog Timeout error (WDT 0x0)

One thing to note is I haven't received it during desktop usage. Only in
Warcraft 3. I do have desktop compositing in both Xfce and KDE disabled and
always have. Both of which used, tested, and received the MCE's during
those sessions. I have noticed a pattern with the MCE crashes with Warcraft
3. They always happen during a GPU load drop off or increase transition. By
that I mean when exiting a match to return to the lobby, or loading a map
and when it switches from the loading screen to the match itself is when
these MCE's happen. The entire screen quickly turns black, everything is
hard locked, and then after about a minute or so the machine reboots on its
own. It hasn't happened yet while in a middle of a match session, sitting
in the lobby or at the main menu screen. Its consistently been during a
transition. My theory is that this could possibly be a GPU hang from
switching from one power state to another power state. With the GPU
hanging, causes the CPU to stall, and thus a MCE. The GPU hanging could
explain the quick solid black screen as well as all output is stopped. But
I'm really just assuming here form my own observations from my limited
understanding. Possible reason why this triggers in Warcraft is because the
other games have few moments of switching power states heavily. The Outer
Worlds, World of Warcraft, Stellaris, and Counter-Strike Global Offensive
all keep a constant high load on the GPU and the match sessions are long.

>From what its worth, I've had no major issues in Windows 10. The only
quirks where initially a few TDR's that recovered from alt tabing out of
most games with Google Chrome running. Disabling hardware acceleration in
Chrome fixed those TDR's while alt-tabing out of games.

>From searching, the way I found this mailing list report, I've found quite
a few reports of people talking about receiving MCE's that isn't the
typical first generation MCE's reports from 2017 involving Ryzen.Where
those where fixed by disabling c-states, ram, and changing power supply
current from low to typical. These ones within the past year appear to all
have a AMD GPU in common. I did notice a few with Intel CPU's as well
paired up with a AMD GPU.

Any feedback would be greatly appreciated.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20200402/aaf96dcd/attachment.htm>


More information about the amd-gfx mailing list