[Bug 206475] amdgpu under load drop signal to monitor until hard reset

Mon Mar 22 09:36:45 UTC 2021

https://bugzilla.kernel.org/show_bug.cgi?id=206475

Marco (rodomar705 at protonmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|---                         |ANSWERED

--- Comment #20 from Marco (rodomar705 at protonmail.com) ---
I finally got where the problem was, and completely fixed it. It was hardware.
The issue was the heatsink was not contacting completely a section on the
mosfets that was feeding power to the core of the card. Under full load they
was thermal tripping for overheating and completely stalling the card to avoid
damages to themselves. The problem was that this card wasn't reporting the
temps of them to software, even if the actual vrm controller was (or if it was
shutting down only when the mosfet trigger purely a signal asserting the
thermal runaway condition). This was hell to debug and fix, as always with
hardware problems, but after a stress test on both Windows and Linux under full
clock, the issue is not present anymore.

I'll keep my optimized clocks for lower temperatures and less fan noise, but
for me the issue wasn't software.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.