[Bug 206475] amdgpu under load drop signal to monitor until hard reset

bugzilla-daemon at bugzilla.kernel.org bugzilla-daemon at bugzilla.kernel.org
Wed Jun 24 20:33:24 UTC 2020


https://bugzilla.kernel.org/show_bug.cgi?id=206475

--- Comment #15 from Andrew Ammerlaan (andrewammerlaan at riseup.net) ---
So today it was *really* hot, and I had this issue occur a couple of times.
(The solution with the extra fans was nice and all, but not enough to prevent
it entirely)

However, now that the iGPU is default, I can still see the system monitor that
I usually run on the other monitor when this issue occurs. Every single time
the thermal sensor of the GPU would show a ridiculous value (e.g. 511 degrees
Celsius).

Now, this could explain why the GPU does a reset. If the thermal sensor would
all of a sudden return a value of e.g. 511, then of course the GPU will shut
itself down. 

As it is clearly impossible for the temperature of the GPU to jump from being
somewhere between 80 to 90, to over 500 within a couple of milliseconds. I
conclude that there is something wrong, either physically with the thermal
sensor, or with the way the firmware/driver handles the temperature reporting
from the sensor. Also, if the GPU would have actually reached a temperature of
511 it would be broken now, as the melting temperature of tin is about 230
degrees Celsius.

I happen to work with thermometers quite a lot, and I have seen temperature
readings do stuff like this. Usually the cause is either a broken, or shorted
sensor (which is unlikely in this case, cause it works normally most of the
time), or a wrong/incomplete calibration curve. (Usually thermal sensors are
only calibrated within the range they are expected to operate, but the high
limit of this calibration curve might be too low.)

Anyway, either the GPU reset is caused by the incorrect temperature readings,
or the incorrect temperature readings are caused by the GPU reset (which is
also possible I guess). In any case, it would be great if AMD could look into
this soon. Because clearly something is wrong.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.


More information about the dri-devel mailing list