Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?

Bridgman, John John.Bridgman at amd.com
Sun Mar 8 19:14:07 UTC 2020


[AMD Public Use]

Fixing the security tag...

________________________________
From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of Bridgman, John <John.Bridgman at amd.com>
Sent: March 8, 2020 3:10 PM
To: Clemens Eisserer <linuxhippy at gmail.com>; amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
Subject: Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?

OK, that's a bit strange... I found mce log and MCE-Ryzen-Decoder as options for decoding.

In MCE-Ryzen-Decoder docco the example is exactly the error you are seeing, with the same output, so guessing that is what you are using:

https://github.com/DimitriFourny/MCE-Ryzen-Decoder<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FDimitriFourny%2FMCE-Ryzen-Decoder&data=02%7C01%7Cjohn.bridgman%40amd.com%7Ca630e03b50564f7f2d3508d7c3946055%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192915078581327&sdata=N8FCig9TNL8tppMXnn9RJ2K%2BIsuYFaBJ7cHvsfhgris%3D&reserved=0>

On the other hand I found a report on AMD forums where the same error is decoded by mce log as a generic error in a memory transaction, which seems to make more sense.

https://community.amd.com/thread/216084<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcommunity.amd.com%2Fthread%2F216084&data=02%7C01%7Cjohn.bridgman%40amd.com%7Ca630e03b50564f7f2d3508d7c3946055%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192915078581327&sdata=G8MPgLKheVdcuA626wFpZwSgqektnTpKkEPnBqlk1QM%3D&reserved=0>

For something as simple as the GPU bus interface not responding to an access by the CPU I think you would get a different error (bus error) but not 100% sure about that.

My first thought would be to see if your mobo BIOS has an option to force PCIE gen3 instead of 4 and see if that makes a difference. There are some amdgpu module parms related to PCIE as well but I'm not sure which ones to recommend.

________________________________
From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of Bridgman, John <John.Bridgman at amd.com>
Sent: March 8, 2020 2:45 PM
To: Clemens Eisserer <linuxhippy at gmail.com>; amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
Subject: Re: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?


[AMD Official Use Only - Internal Distribution Only]

The decoded MCE info doesn't look right... if the last bit is a zero I believe that means the watchdog timer is not enabled.

That said, I'm not sure how the decoder you found works, but it seems like a bit more information would be required than what you passed in. Can you point me to the program you used ?

Thanks,
John

________________________________
From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of Clemens Eisserer <linuxhippy at gmail.com>
Sent: March 8, 2020 9:06 AM
To: amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
Subject: Possibility of RX570 responsible for spontaneous reboots (MCE) with Ryzen 3700x?

Hi there,

Right after Ryzen3xxx was available I built a new system consisting of:
- Asrock Phantom Gaming 4 X570 (latest BIOS 2.3)
- Ryzen 3700x (not overclocked)
- MSI RX570 4GB
- Larger CPU cooler, high quality PSU, etc...

The system runs stable with Windows-10 (no reboot BSOD in months) and
runs memtest86 (single/multicore) as well as various load-tests for
hours without errors. However running Linux I get a spontaneous reboot
every now and then (2-3x a week), with always the same machine check
exception logged:

[    0.105003] .... node  #0, CPUs:        #1  #2
[    0.107022] mce: [Hardware Error]: Machine check events logged
[    0.107023] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5:
bea0000000000108
[    0.107092] mce: [Hardware Error]: TSC 0 ADDR 7f80a0c0181a MISC
d012000100000000 SYND 4d000000 IPID 500b000000000
[    0.107167] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
1580717835 SOCKET 0 APIC 4 microcode 8701013

I've tried a lot of different CPU-related things, like disabling C6,
disabling MWAIT use for task switching, etc without success.
I tried two times to contact AMD support only asking them to please
decode the MCE hex value - but as soon as they read over the term
"linux" the basically abort any communication. And to be honest, I had
the impression that they did not actually know what an MCE is in the
first place.

Luckily I found a decoder on github which prints:
Bank: Execution Unit (EX)
Error: Watchdog Timeout error (WDT 0x0)

I was rather hopeless until I found the following reddit thread:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.reddit.com%2Fr%2Farchlinux%2Fcomments%2Fe33nyg%2Fhard_reboots_with_ryzen_3600x%2F&data=02%7C01%7Cjohn.bridgman%40amd.com%7C683b51328ba1471c113c08d7c3619d90%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192697111254592&sdata=4TuB0a0VHxTqd8R0xLwxg%2BOv1vu8C7L%2FLW4O0EOiq1I%3D&reserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.reddit.com%2Fr%2Farchlinux%2Fcomments%2Fe33nyg%2Fhard_reboots_with_ryzen_3600x%2F&data=02%7C01%7Cjohn.bridgman%40amd.com%7Ca630e03b50564f7f2d3508d7c3946055%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192915078591321&sdata=QAbr3IkabyLUlYrR4K%2B%2BOpVbkf5BPEgNjrnDSltoQNg%3D&reserved=0>
what the decoder logic is
The users there claim to experience exactly the same problem (even
with the same MCE-Code logged) but where using R600 based graphics
cards - he is even using the same mainboard. When he swapped his
R600-card with a new RX5700 the problems vanished.

I don't have the luxury to simply try another GPU (my RX5700 is the
only one properly driving my 4k at 60Hz panel), however the whole
observation makes me wonder. How can a GPU be responsible for
low-level errors such as the machine check exception in the execution
units like the one mentioned above.
Could DMA transfers gone bad be the cluprit?
Are there any "safe mode" options available I could try regarding
amdgpu (I tried disabling low-power states but this didn't help and
only made my GPU fans spin up)?

Any help is highly appreciated.

Thanks, Clemens
_______________________________________________
amd-gfx mailing list
amd-gfx at lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Cjohn.bridgman%40amd.com%7C683b51328ba1471c113c08d7c3619d90%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192697111264585&sdata=L52zHeIm8GzEr5eYjUDm5bPK4U1DF0t1GtaxaUy9qHY%3D&reserved=0<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7Cjohn.bridgman%40amd.com%7Ca630e03b50564f7f2d3508d7c3946055%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637192915078601316&sdata=2Gkq6rDmH3ZDMpYEoC27%2FL3FrHbzPWlcZ493oFEpJIk%3D&reserved=0>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20200308/97e95fb2/attachment.htm>


More information about the amd-gfx mailing list