[Bug 93101] GPU Fault almost burned the CPU

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Wed Nov 25 02:00:11 PST 2015


https://bugs.freedesktop.org/show_bug.cgi?id=93101

            Bug ID: 93101
           Summary: GPU Fault almost burned the CPU
           Product: Mesa
           Version: git
          Hardware: Other
                OS: All
            Status: NEW
          Severity: normal
          Priority: medium
         Component: Drivers/Gallium/radeonsi
          Assignee: dri-devel at lists.freedesktop.org
          Reporter: dev at illwieckz.net
        QA Contact: dri-devel at lists.freedesktop.org

Created attachment 120103
  --> https://bugs.freedesktop.org/attachment.cgi?id=120103&action=edit
syslog (short)

Hi, this is an issue about the fact that some GPU lockup can lead to some CPU
burn (for real).

Some hours ago I get a GPU lockup while I was trying to read a DVD with VLC.
The video rendering wasn't functionnal (no picture), then the GPU started to
display weird things (see attached photo) then locked up.

I've joined some log, one very long syslog, and some abstract for this one
(more easy to read, but I gave you the original one in case of I missed
something).

To summarize, you can read lines like that in the syslog:

```
Nov 24 22:58:18 gollum gnome-session[3720]: [00007f134c173c20] avcodec decoder:
Using G3DVL VDPAU Driver Shared Library version 1.0 for hardware decoding
Nov 24 22:58:18 gollum kernel: [97035.599456] radeon 0000:01:00.0:  
VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00002126
Nov 24 22:58:18 gollum kernel: [97035.599460] radeon 0000:01:00.0:  
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0408800C
Nov 24 22:58:18 gollum kernel: [97035.599465] VM fault (0x0c, vmid 2) at page
8486, read from 'TC4' (0x54433400) (136)
Nov 24 22:58:55 gollum kernel: [97072.747472] radeon 0000:01:00.0: ring 0
stalled for more than 10088msec
Nov 24 22:58:55 gollum kernel: [97072.747483] radeon 0000:01:00.0: GPU lockup
(current fence id 0x000000000059fcff last fence id 0x000000000059fd12 on ring
0)
Nov 24 22:59:04 gollum kernel: [97081.259933] WARNING: CPU: 4 PID: 23502 at
/home/kernel/COD/linux/drivers/gpu/drm/radeon/radeon_object.c:83
radeon_ttm_bo_destroy+0xe7/0xf0 [radeon]()
```

My system is running:

vlc 3.0.0~~git20151123+r62463+34~ubuntu15.10.1
linux-image-4.3.0-040300-generic 4.3.0-040300.201511020949
libdrm-radeon1 2.4.65+git1511161830.8913cd~gd~w
xserver-xorg-video-radeon 7.6.99+git1511170732.10b7c3~gd~w
libgl1-mesa-dri 11.2~git1511231930.e4c122~gd~w
mesa-vdpau-drivers 11.2~git1511231930.e4c122~gd~w

That is a real issue but it's not the topic of this ticket.

The really big problem is this bug almost burned my CPU. I explain.

When the bug occurred, I tried to track it. Instead of rebooting my computer I
started a laptop in order to connect to my computer using ssh, and to diagnose
some stuff on the living system. While the laptop were booting, I took some
photo of my screen.

But suddenly, my computer shutdown itself. The CPU critical temperature was
reached.

Normal operation temperature is normally between 30°C and 40°C on my system. In
case of emergency, I have two regulators running on my computer. The first one
raises fan speed from 128 tr/min to 1400 tr/min when temperature reaches 50°C,
and the second one downclocks all the 8 core from 4.7 GHz to 1.4GHz when the
temperature reaches 70°C.

Both regulators are userspace regulators. The first is the well-known
fancontrol, and the other one is mine. Both works well (if I use cpuburn for
example).

The fact is, when the GPU lockup occurred, something from the driver goes wrong
on the CPU side. It looks like some infinite loop started on my cores, doing
some extensive tasks, probably without having to deal with external components
(like central memory unit) in order to never slow done the CPU.

In fact, the computer acted exactly like if I was running one cpuburn process
per core using performance cpu governor during a summer noon. But there was an
exception, the fan never accelerated (so it was still running at 128 tr/min
when the CPU reached 90°, and the cpu was never downclocked too.

That's why I wrote this issue. When this bug occured, the system goes so wrong
the CPU was on knees and no regulator was able to control the CPU fan so the
CPU endlessly heating.

Hopefully, the internal CPU temperature protection shutdown automatically my
computer to save itself. But if someone use a CPU with a faulty temperature
safety mechanisme, this GPU lockup can lead to a CPU burn for real !

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/dri-devel/attachments/20151125/c6b5dda9/attachment.html>


More information about the dri-devel mailing list