Regression in 6.6: trying to set DPMS mode kills radeon (r600)

Holger Hoffstätte holger at applied-asynchrony.com
Sat Dec 16 17:36:57 UTC 2023


Hi,

since the last regression in 6.6.x was so popular I thought I'd post another
one. :)

The good news first: the symptoms described below do *not* reproduce with
6.7-rc5-git as of today (c8e97fc6b4c0), and there's an acceptable workaround
for 6.6.x as well.

The affected machine is an older SandyBridge dektop with a fanless
r600 Redwood GPU, using the radeon driver. "Recently" - some time
after the last few 6.6.x stable updates - it started to die with GPU
lockups. I first blamed this on standby/resume - because why not? - but
this turned out to be wrong; the real culprit is DPMS.

I use xfce-power-manager as "screensaver" to turn off the display after
inacitvity. This can be configured in two ways: "suspend" and "poweroff".
I've been using "poweroff" since forever without problems, until now.

The symptom is that everything works fine until the screensaver kicks in
and tries to turn the monitor off, which sends the radeon driver and the GPU
into a complete tailspin.

Here is the GPU during boot - all good:

Dec 12 12:47:54 ragnarok kernel: [drm] radeon kernel modesetting enabled.
Dec 12 12:47:54 ragnarok kernel: Console: switching to colour dummy device 80x25
Dec 12 12:47:54 ragnarok kernel: radeon 0000:01:00.0: vgaarb: deactivate vga console
Dec 12 12:47:54 ragnarok kernel: [drm] initializing kernel modesetting (REDWOOD 0x1002:0x68DA 0x1043:0x0376 0x00).
Dec 12 12:47:54 ragnarok kernel: ATOM BIOS: 68DA.12.19.0.3.AS01
Dec 12 12:47:54 ragnarok kernel: radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000 - 0x000000003FFFFFFF (1024M used)
Dec 12 12:47:54 ragnarok kernel: radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 - 0x000000007FFFFFFF
Dec 12 12:47:54 ragnarok kernel: [drm] Detected VRAM RAM=1024M, BAR=256M
Dec 12 12:47:54 ragnarok kernel: [drm] RAM width 128bits DDR
Dec 12 12:47:54 ragnarok kernel: [drm] radeon: 1024M of VRAM memory ready
Dec 12 12:47:54 ragnarok kernel: [drm] radeon: 1024M of GTT memory ready.
Dec 12 12:47:54 ragnarok kernel: [drm] Loading REDWOOD Microcode
Dec 12 12:47:54 ragnarok kernel: [drm] Internal thermal controller without fan control
Dec 12 12:47:54 ragnarok kernel: [drm] radeon: dpm initialized
Dec 12 12:47:54 ragnarok kernel: [drm] GART: num cpu pages 262144, num gpu pages 262144
Dec 12 12:47:54 ragnarok kernel: [drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0
Dec 12 12:47:54 ragnarok kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000000014C000).
Dec 12 12:47:54 ragnarok kernel: radeon 0000:01:00.0: WB enabled
Dec 12 12:47:54 ragnarok kernel: radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00
Dec 12 12:47:54 ragnarok kernel: radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c
Dec 12 12:47:54 ragnarok kernel: radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x000000000005c418
Dec 12 12:47:54 ragnarok kernel: radeon 0000:01:00.0: radeon: MSI limited to 32-bit
Dec 12 12:47:54 ragnarok kernel: radeon 0000:01:00.0: radeon: using MSI.
Dec 12 12:47:54 ragnarok kernel: [drm] radeon: irq initialized.
Dec 12 12:47:54 ragnarok kernel: [drm] ring test on 0 succeeded in 1 usecs
Dec 12 12:47:54 ragnarok kernel: [drm] ring test on 3 succeeded in 2 usecs
Dec 12 12:47:54 ragnarok kernel: [drm] ring test on 5 succeeded in 1 usecs
Dec 12 12:47:54 ragnarok kernel: [drm] UVD initialized successfully.
Dec 12 12:47:54 ragnarok kernel: [drm] ib test on ring 0 succeeded in 0 usecs
Dec 12 12:47:54 ragnarok kernel: [drm] ib test on ring 3 succeeded in 0 usecs
Dec 12 12:47:55 ragnarok kernel: [drm] ib test on ring 5 succeeded
Dec 12 12:47:55 ragnarok kernel: [drm] Radeon Display Connectors
Dec 12 12:47:55 ragnarok kernel: [drm] Connector 0:
Dec 12 12:47:55 ragnarok kernel: [drm]   HDMI-A-1
Dec 12 12:47:55 ragnarok kernel: [drm]   HPD2
Dec 12 12:47:55 ragnarok kernel: [drm]   DDC: 0x6450 0x6450 0x6454 0x6454 0x6458 0x6458 0x645c 0x645c
Dec 12 12:47:55 ragnarok kernel: [drm]   Encoders:
Dec 12 12:47:55 ragnarok kernel: [drm]     DFP1: INTERNAL_UNIPHY2
Dec 12 12:47:55 ragnarok kernel: [drm] Connector 1:
Dec 12 12:47:55 ragnarok kernel: [drm]   DVI-I-1
Dec 12 12:47:55 ragnarok kernel: [drm]   HPD1
Dec 12 12:47:55 ragnarok kernel: [drm]   DDC: 0x6460 0x6460 0x6464 0x6464 0x6468 0x6468 0x646c 0x646c
Dec 12 12:47:55 ragnarok kernel: [drm]   Encoders:
Dec 12 12:47:55 ragnarok kernel: [drm]     DFP2: INTERNAL_UNIPHY1
Dec 12 12:47:55 ragnarok kernel: [drm]     CRT1: INTERNAL_KLDSCP_DAC1
Dec 12 12:47:55 ragnarok kernel: [drm] Connector 2:
Dec 12 12:47:55 ragnarok kernel: [drm]   VGA-1
Dec 12 12:47:55 ragnarok kernel: [drm]   DDC: 0x6470 0x6470 0x6474 0x6474 0x6478 0x6478 0x647c 0x647c
Dec 12 12:47:55 ragnarok kernel: [drm]   Encoders:
Dec 12 12:47:55 ragnarok kernel: [drm]     CRT2: INTERNAL_KLDSCP_DAC2
Dec 12 12:47:55 ragnarok kernel: [drm] Initialized radeon 2.50.0 20080528 for 0000:01:00.0 on minor 0
Dec 12 12:47:55 ragnarok kernel: [drm] fb mappable at 0xE034D000
Dec 12 12:47:55 ragnarok kernel: [drm] vram apper at 0xE0000000
Dec 12 12:47:55 ragnarok kernel: [drm] size 9216000
Dec 12 12:47:55 ragnarok kernel: [drm] fb depth is 24
Dec 12 12:47:55 ragnarok kernel: [drm]    pitch is 7680
Dec 12 12:47:55 ragnarok kernel: fbcon: radeondrmfb (fb0) is primary device
Dec 12 12:47:55 ragnarok kernel: Console: switching to colour frame buffer device 240x75
Dec 12 12:47:55 ragnarok kernel: radeon 0000:01:00.0: [drm] fb0: radeondrmfb frame buffer device

Eventually the screensaver tries to switch off the monitor via DPMS "poweroff" method and
this greatly upsets the GPU:

Dec 12 20:39:59 ragnarok kernel: radeon 0000:01:00.0: ring 0 stalled for more than 10140msec
Dec 12 20:39:59 ragnarok kernel: radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000000002 last fence id 0x0000000000000003 on ring 0)

The above repeats many times until:

Dec 12 20:40:18 ragnarok kernel: radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000000002 last fence id 0x0000000000000003 on ring 0)
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0: Saved 55 dwords of commands on ring 0.
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0: GPU softreset: 0x00000008
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0:   GRBM_STATUS               = 0xA0003828
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000007
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000007
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00010100
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00020180
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80038042
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0: GRBM_SOFT_RESET=0x00004001
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003828
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000007
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000007
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0:   SRBM_STATUS               = 0x200000C0
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0: GPU reset succeeded, trying to resume
Dec 12 20:40:19 ragnarok kernel: [drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0
Dec 12 20:40:19 ragnarok kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000000014C000).
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0: WB enabled
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000040000c00
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000040000c0c
Dec 12 20:40:19 ragnarok kernel: radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x000000000005c418
Dec 12 20:40:19 ragnarok kernel: debugfs: File 'radeon_ring_gfx' in directory '0' already present!
Dec 12 20:40:19 ragnarok kernel: debugfs: File 'radeon_ring_dma1' in directory '0' already present!
Dec 12 20:40:19 ragnarok kernel: [drm:r600_ring_test [radeon]] *ERROR* radeon: ring 0 test failed (scratch(0x8504)=0xCAFEDEAD)
Dec 12 20:40:19 ragnarok kernel: [drm:evergreen_resume [radeon]] *ERROR* evergreen startup failed on resume
Dec 12 20:40:29 ragnarok kernel: radeon 0000:01:00.0: ring 0 stalled for more than 10077msec
Dec 12 20:40:29 ragnarok kernel: radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000000002 last fence id 0x0000000000000003 on ring 0)

..at which point I have to poweroff the entire machine to reset the GPU; a simple
reboot does not help.

It took me a while to find the culprit (the DPMS poweroff method), and for now
my workaround is to switch xfce-power-manager to use the "suspend" method instead.
This has been working reliably for several days now, with many system suspend/resume
and display suspend/wakeup cycles.

I *think* this started to happen with 6.6.5, which had a lot of drm-amd-display
patches (see[1]), so before I start bisecting and whatnot I was wondering if anybody
has any suggestions what might be the cause here. I can gladly test additional
patches or reverts. Even if 6.7-rc seems to be OK again I still think it's worthwile
to fix this in 6.6 since it's an LTS kernel.

Thanks for reading!
Holger

[1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/tree/queue-6.6?id=3601ed703d11a9140852b279482666d6f53e1c06


More information about the amd-gfx mailing list