[Bug 111989] Diagnosing issues with Radeon VII

Mon Oct 14 07:27:35 UTC 2019

https://bugs.freedesktop.org/show_bug.cgi?id=111989

            Bug ID: 111989
           Summary: Diagnosing issues with Radeon VII
           Product: DRI
           Version: unspecified
          Hardware: x86-64 (AMD64)
                OS: Linux (All)
            Status: NEW
          Severity: major
          Priority: not set
         Component: DRM/AMDgpu
          Assignee: dri-devel at lists.freedesktop.org
          Reporter: ragnaros39216 at yandex.com

I'm in the process of diagnosing issues with a Radeon VII that I might have
damaged during the attempts to improve its thermal conditions. Prior to all
this the card has no major issues, just that it still runs too hot while mining
(around 80-90 celsius even with fan maxed out via Radeon Profile, which, as
well as the noise, was beyond acceptable and was the main reason why I wanted
to improve the thermal condition in the first place).

The GPU in question now automatically switches to some kind of "safe clock" of
700/350 (as observed in Radeon Profile) when under heavy load such as mining
(using ROCm backend on Manjaro/Arch), and cannot return to normal clock on its
own. While I can force the default clocks back using Radeon Profile, however,
if the card is still under load, the screen will immediately become messed up
and a few seconds later the system hard resets with the GPU not detected in
subsequent boots (as the screen got routed to the BMC on the motherboard
instead of the video card) until I do a power cycle (manually or via IPMI).

After some failed attempts to mod the stock cooler to improve thermal condition
(during which the symptoms began), I eventually replaced the cooler altogether
with an Alphacool Eiswolf for this card. Despite the thermal condition has been
improved greatly (it can still run Unigine Heaven tests at full clock for a
short while without issues and at an acceptable 60 celsius), however, the issue
with entering "safe clock" while mining does not go away.

I was able to get a usable under-load GPU clock of 1150MHz with Radeon Profile
after some testing (it runs at around 40 celsius under load), but the condition
only gets worse as now I can only maintain stable clock at around 1000MHz
without entering "safe clock" too quickly. The "safe clock" can still kick in
when I'm doing something else while mining, but as long as the clocks are set
below safe ranges, I do not get system lockup/resets if I force the clock back
(by reapplying).

I couldn't get any detailed logs yet as I haven't switched on debug parameters
for amdgpu, but recently I was able to capture one occurrence with the log
ended with "ring timeout" and "GPU reset begin" before the system hard reset.

I don't know where to start the investigation and find what caused the "safe
clock" to trigger and, in case the card really got damaged, which CUs are
causing issues (that I need to disable, as I just found out that I could
disable CUs using boot parameters). I'm not sure which debug parameters I can
use to get the information I need to look into the issue.

The current PSU installed on the system is an EVGA Supernova 750 P2 (750W 80+
Platinum) and I have both power connectors on the video card connected. The
power supply should be sufficient and shouldn't be a problem.

After all, the experience with this card raised a lot of questions that I
previously have neglected, especially regarding cooling, such as which kind of
thermal compound/pads to use, where and how to apply/place them... but
personally, cooling was never this hard to get right even with some very
power-hungry CPUs I currently have.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20191014/120ed571/attachment.html>