[Mesa-dev] [Bug 111080] Random crash on amdgpu due to temperature missrepoorting

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Sun Jul 7 15:53:57 UTC 2019


            Bug ID: 111080
           Summary: Random crash on amdgpu due to temperature
           Product: Mesa
           Version: unspecified
          Hardware: x86-64 (AMD64)
                OS: Linux (All)
            Status: NEW
          Severity: major
          Priority: medium
         Component: Mesa core
          Assignee: mesa-dev at lists.freedesktop.org
          Reporter: timitch_1 at yahoo.com
        QA Contact: mesa-dev at lists.freedesktop.org

Created attachment 144716
  --> https://bugs.freedesktop.org/attachment.cgi?id=144716&action=edit
amdgpu_pm_info information from start of game to crash


I have been experiencing some random crash in dota 2 for the past 2 years. 
Changed everything in the computer 6900k -> threadripper, corsaire memory ->
gskill, radeon frontier -> radeon vega 7. Ubuntu 16.04 ->16.10 -> 17.04 ->
17.10 ->18.04 ->18.10 ->19.04.  This is with all the mesa version in between
currently on 
"OpenGL renderer string: AMD Radeon VII (VEGA20, DRM 3.32.0, 5.2.0-rc7+, LLVM
OpenGL core profile version string: 4.5 (Core Profile) Mesa 19.2.0-devel -
padoka PPA
OpenGL core profile shading language version string: 4.50
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile
All experience the same random crash. 

I finally got on lead on the problem seeing the GPU reporting unrealistic
values, ex: MHZ jump to 10 000 range. Around the time of the crash temperature
in the logs goes from  62c to 500c within two seconds back to 62c. This I
suspect would cause the GPU to apply its protection and freeze and if it was
true, also violate some law of physics.

Most other tool I use to test the grapgic card, example Uningine, reports
correct values within the supported range defined for the cards which are

#0:        808Mhz        704mV
#1:       1304Mhz        777mV
#2:       1801Mhz       1054mV
#SCLK:     808Mhz       2200Mhz
#MCLK:     351Mhz       1200Mhz

Attached is an example generated with 
"watch -t -n1 'cat /sys/kernel/debug/dri/1/amdgpu_pm_info|grep -A 9 "GFX
Clocks" | tee -a /home/mitch/tmp/gpulog.txt'"

Example grep Temp
GPU Temperature: 70 C
GPU Temperature: 511 C
GPU Temperature: 69 C

grep \(SLCK
        1924 MHz (SCLK)
        5422 MHz (SCLK)
        1999 MHz (SCLK)

I realize the issue might be somewhere else than the mesa driver but would like
to know where this could be and if anybody else seen this kind of behaviour

Thank you very much for any help

You are receiving this mail because:
You are the assignee for the bug.
You are the QA Contact for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20190707/143de5ec/attachment.html>

More information about the mesa-dev mailing list