[Mesa-dev] [Bug 111080] Random crash on amdgpu due to temperature missrepoorting

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Sun Jul 7 15:53:57 UTC 2019


https://bugs.freedesktop.org/show_bug.cgi?id=111080

            Bug ID: 111080
           Summary: Random crash on amdgpu due to temperature
                    missrepoorting
           Product: Mesa
           Version: unspecified
          Hardware: x86-64 (AMD64)
                OS: Linux (All)
            Status: NEW
          Severity: major
          Priority: medium
         Component: Mesa core
          Assignee: mesa-dev at lists.freedesktop.org
          Reporter: timitch_1 at yahoo.com
        QA Contact: mesa-dev at lists.freedesktop.org

Created attachment 144716
  --> https://bugs.freedesktop.org/attachment.cgi?id=144716&action=edit
amdgpu_pm_info information from start of game to crash

Hi, 

I have been experiencing some random crash in dota 2 for the past 2 years. 
Changed everything in the computer 6900k -> threadripper, corsaire memory ->
gskill, radeon frontier -> radeon vega 7. Ubuntu 16.04 ->16.10 -> 17.04 ->
17.10 ->18.04 ->18.10 ->19.04.  This is with all the mesa version in between
currently on 
"OpenGL renderer string: AMD Radeon VII (VEGA20, DRM 3.32.0, 5.2.0-rc7+, LLVM
9.0.0)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 19.2.0-devel -
padoka PPA
OpenGL core profile shading language version string: 4.50
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile
"
All experience the same random crash. 

I finally got on lead on the problem seeing the GPU reporting unrealistic
values, ex: MHZ jump to 10 000 range. Around the time of the crash temperature
in the logs goes from  62c to 500c within two seconds back to 62c. This I
suspect would cause the GPU to apply its protection and freeze and if it was
true, also violate some law of physics.

Most other tool I use to test the grapgic card, example Uningine, reports
correct values within the supported range defined for the cards which are

"
#OD_VDDC_CURVE:
#0:        808Mhz        704mV
#1:       1304Mhz        777mV
#2:       1801Mhz       1054mV
#OD_RANGE:
#SCLK:     808Mhz       2200Mhz
#MCLK:     351Mhz       1200Mhz
"

Attached is an example generated with 
"watch -t -n1 'cat /sys/kernel/debug/dri/1/amdgpu_pm_info|grep -A 9 "GFX
Clocks" | tee -a /home/mitch/tmp/gpulog.txt'"

Example grep Temp
"
GPU Temperature: 70 C
GPU Temperature: 511 C
GPU Temperature: 69 C
"

grep \(SLCK
"
        1924 MHz (SCLK)
        5422 MHz (SCLK)
        1999 MHz (SCLK)
"



I realize the issue might be somewhere else than the mesa driver but would like
to know where this could be and if anybody else seen this kind of behaviour

Thank you very much for any help

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are the QA Contact for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20190707/143de5ec/attachment.html>


More information about the mesa-dev mailing list