[Bug 81001] New: radeon: fence wait failed (-35) after hybrid suspend, leading to GPU reset and hangs

Wed Jul 23 05:06:35 PDT 2014

https://bugzilla.kernel.org/show_bug.cgi?id=81001

            Bug ID: 81001
           Summary: radeon: fence wait failed (-35) after hybrid suspend,
                    leading to GPU reset and hangs
           Product: Drivers
           Version: 2.5
    Kernel Version: 3.15+
          Hardware: i386
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Video(DRI - non Intel)
          Assignee: drivers_video-dri at kernel-bugs.osdl.org
          Reporter: iive at yahoo.com
        Regression: No

Created attachment 144041
  --> https://bugzilla.kernel.org/attachment.cgi?id=144041&action=edit
dmesg of suspend/resume session with radeon.ko dpm debug turned on

My hardware is Radeon HD5670 (Redwood).
To reproduce the problem boot vanilla 3.15.x kernel. Run in KMS mode (no Xorg
server needed). Then suspend to ram-and-disk with the following command:

`echo suspend  > /sys/power/disk; echo disk > /sys/power/state`

Resume from the power button. In `dmesg` you can find:

[   83.997399] [drm] ring test on 5 succeeded in 1 usecs
[   83.997403] [drm] UVD initialized successfully.
[   83.997450] [drm] ib test on ring 0 succeeded in 0 usecs
[   83.997494] [drm] ib test on ring 3 succeeded in 1 usecs
[   94.137259] radeon 0000:01:00.0: ring 5 stalled for more than 10000msec
[   94.137263] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000004
last fence id 0x0000000000000002 on ring 5)
[   94.137265] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed (-35).
[   94.137268] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on
ring 5 (-35).

At this point if Xorg server is started, any attempt to vdpau hardware decoding
would fail.
If the computer is left working, without reboot, at some point timeout would
trigger and GPU restart might be attempted, usually hanging the system. (I took
the following log from older GPU restart, probably successful).

[    0.000000] Linux version 3.15.2 (root) (gcc version 4.8.3 (GCC) ) #2 SMP
[12398.387691] radeon 0000:01:00.0: ring 5 stalled for more than 242796msec
[12398.387699] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000005
last fence id 0x0000000000000004 on ring 5)
[12398.425151] radeon 0000:01:00.0: Saved 23 dwords of commands on ring 0.
[12398.425167] radeon 0000:01:00.0: GPU softreset: 0x00000009
[12398.425169] radeon 0000:01:00.0:   GRBM_STATUS               = 0xF5703828
[12398.425171] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0xFC000007
[12398.425173] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000007
[12398.425175] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200800C0
[12398.425177] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[12398.425179] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[12398.425181] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x40000000
[12398.425183] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00008004
[12398.425185] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80228647
[12398.425187] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[12398.441572] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x00007F6B
[12398.441626] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
[12398.442782] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003828
[12398.442783] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000007
[12398.442785] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000007
[12398.442787] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200800C0
[12398.442789] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[12398.442791] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[12398.442793] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[12398.442795] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[12398.442796] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
[12398.442798] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[12398.442813] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[12398.512161] [drm] enabling PCIE gen 2 link speeds, disable with
radeon.pcie_gen2=0
[12398.516583] [drm] PCIE GART of 1024M enabled (table at 0x000000000025D000).

The bug vanishes if only suspend to ram or suspend to disk is used.

I tried to do a bisect, but a number of rc kernels seem to hang on me, long
before radeon module is loaded. At this point bisect is not feasible.

I suspected that the new async suspend/resume code might be at fault, as I was
seeing video card been turned off (aka monitor going off) and then turning on,
a moment before shutting down completely.

So I found the commits of the async suspend (from an article) and reverted
them.
Reverting 200421a80f6e0a9e39d698944cc35cba103eb6ce,
3c31b52f96f7b559d950b16113c0f68c72a1985e seems to avoid the above effect about
monitor turning off, on, then off again. But it does not fix this bug. 

Reverting 
7cd0602d7836c0056fe9bdab014d5ac5ec5cb291,
92858c476ec4e99cf0425f05dee109b6a55eb6f8 and
9e5e7910df824ba02aedd2b5d2ca556426ea6d0b,
76569faa62c46382e080c3e190c66e19515aae1c,
de377b3972729f00ee236ae4a97393e282ffe391,
28b6fd6e37792b16a56d324841bdb20ab78e4522,
a59ffb2062df3a5c346dbed931fa1e587fd0f0f3
doesn't affect the bug either, so I assume that this bug is not related to
suspend/resume async changes.

If you cannot reproduce the problem, please advice me what commits to revert.

This bugreport copy of https://bugs.freedesktop.org/show_bug.cgi?id=81620

-- 
You are receiving this mail because:
You are watching the assignee of the bug.