[Bug 206475] amdgpu under load drop signal to monitor until hard reset

Thu Jun 25 09:58:46 UTC 2020

https://bugzilla.kernel.org/show_bug.cgi?id=206475

--- Comment #17 from Andrew Ammerlaan (andrewammerlaan at riseup.net) ---
(In reply to Alex Deucher from comment #16)
> When the GPU is in reset all reads to the MMIO BAR return 1s so you are just
> getting all ones until the reset succeeds.  511 is just all ones.  This
> patch will fix that issue:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=9271dfd9e0f79e2969dcbe28568bce0fdc4f8f73

Well there goes my hypotheses of the broken thermal sensor xD.

I did discover yesterday that the fan of my GPU spins relatively slow under
high load. When the GPU reached ~80 degrees Celsius, the fan didn't even spin
at half the maximum RPM! I used the pwmconfig script and the fancontrol service
from lm_sensors to force the fan to go to the maximum RPM just before reaching
80 degrees Celsius. It's very noisy, *but* the GPU stays well below 70 degrees
Celsius now, even under heavy load. As this issue seems to occur only when the
GPU is hotter then ~75 degrees Celsius, I'm hoping that this will help in
preventing the problem.

I'm still confused as to why this is at all necessary, the critical temperature
is 91, so why do I encounter these issues at ~80?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.