[Bug 93341] Semi-random GPU lockups on radeonsi with a RadeonHD 7770 (when playing videos, running OpenGL games, WebGL apps, or after extended periods of time)

Tue Apr 11 22:20:05 UTC 2017

https://bugs.freedesktop.org/show_bug.cgi?id=93341

--- Comment #26 from Jean-François Fortin Tam <nekohayo at gmail.com> ---
OK, I've got good news... Julien, thanks to the crazy furry donut "torture
test" you suggested, I was able to finally pinpoint the real trigger for this
bug.

My understanding is that on Radeons (well, at least the Radeon HD 7770), there
is an emergency mechanism in the hardware (or firmware/microcode maybe) that
activates self-throttling of performances when the GPU reaches a critical
temperature. Normally, the video driver is supposed to handle this state change
gracefully, however the radeonsi/radeon/amdgpu driver on Linux does not, so the
kernel panics because the driver went belly up.

During additional testing today, where I forced my GPU to overheat, I was able
to determine that the critical point is the same as on Windows: 113 degrees
Celsius. As soon as you go over 112... boom, dead radeonsi driver + kernel oops
(with the same error messages as my previous logs above). Additionally,
lm_sensors thinks the temperature has instantly jumped to 511 degrees Celsius
(!), and the readings stay stuck at 511 Celsius.

"Duh! Just get better cooling!" might sound like a workaround (just like
keeping the case open), but nope, technically, it's still a software/driver
issue: the Linux driver should handle such scenarios gracefully just as well as
the Windows driver. In Windows, breaching the 110-113 degrees Celsius limit
results in the video driver simply dropping frames massively, continuing to
function at reduced performance (ie: going from 40-60 fps to 10-15 fps on one
of my benchmarks). The system never crashes.

So the bug here, as I understand it, is that the radeonsi driver on Linux does
not handle the event where the hardware force-throttles itself.

---------
Contextual notes:
The reason why I only started experiencing this issue in December 2015 (as I've
had the GPU since 2012) was that I changed my PC case then, which means a
different airflow and cooling behavior... And the reason why it was so hard to
get consistent crashes here was that when I was trying to troubleshoot it, I
was sometimes doing it with the case closed, sometimes with the case open (when
trying with a different power supply unit using a "siamese transplant" across
another computer, for example). If I keep my case open, the card will never
reach the critical temperature and so the issue will not happen. I might get a
system "freeze" (possibly saying "*ERROR*
si_restrict_performance_levels_before_switch failed") after many hours of
torture testing, but the symptoms are different (the screen does not turn off,
image stays on with everything frozen, and nothing else in the logs) and so I
presume that to be a different issue.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20170411/685e14f0/attachment.html>