[Bug 81620] New: radeon: fence wait failed (-35) after hybrid suspend on 3.15

Mon Jul 21 13:02:39 PDT 2014

https://bugs.freedesktop.org/show_bug.cgi?id=81620

          Priority: medium
            Bug ID: 81620
          Assignee: dri-devel at lists.freedesktop.org
           Summary: radeon: fence wait failed (-35) after hybrid suspend
                    on 3.15
          Severity: normal
    Classification: Unclassified
                OS: Linux (All)
          Reporter: iive at yahoo.com
          Hardware: x86 (IA32)
            Status: NEW
           Version: unspecified
         Component: DRM/Radeon
           Product: DRI

Created attachment 103217
  --> https://bugs.freedesktop.org/attachment.cgi?id=103217&action=edit
dmesg of suspend/resume session with radeon.ko dpm debug turned on

My hardware is Radeon HD5670 (Redwood).
To reproduce the problem boot vanilla 3.15.x kernel. Run in KMS mode (no Xorg
server needed). Then suspend to ram-and-disk with the following command:

`echo suspend  > /sys/power/disk; echo disk > /sys/power/state`

Resume from the power button. In `dmesg` you can find:

[   83.997399] [drm] ring test on 5 succeeded in 1 usecs
[   83.997403] [drm] UVD initialized successfully.
[   83.997450] [drm] ib test on ring 0 succeeded in 0 usecs
[   83.997494] [drm] ib test on ring 3 succeeded in 1 usecs
[   94.137259] radeon 0000:01:00.0: ring 5 stalled for more than 10000msec
[   94.137263] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000004
last fence id 0x0000000000000002 on ring 5)
[   94.137265] [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed (-35).
[   94.137268] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on
ring 5 (-35).

At this point if Xorg server is started any attempt to vdpau hardware decoding
would fail.
If the computer is left working, without reboot, at some point timeout would
trigger and GPU restart might be attempted, usually hanging the system. (I took
the following log from older GPU restart, probably successful).

[    0.000000] Linux version 3.15.2 (root) (gcc version 4.8.3 (GCC) ) #2 SMP
[12398.387691] radeon 0000:01:00.0: ring 5 stalled for more than 242796msec
[12398.387699] radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000005
last fence id 0x0000000000000004 on ring 5)
[12398.425151] radeon 0000:01:00.0: Saved 23 dwords of commands on ring 0.
[12398.425167] radeon 0000:01:00.0: GPU softreset: 0x00000009
[12398.425169] radeon 0000:01:00.0:   GRBM_STATUS               = 0xF5703828
[12398.425171] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0xFC000007
[12398.425173] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000007
[12398.425175] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200800C0
[12398.425177] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[12398.425179] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[12398.425181] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x40000000
[12398.425183] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00008004
[12398.425185] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x80228647
[12398.425187] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[12398.441572] radeon 0000:01:00.0: GRBM_SOFT_RESET=0x00007F6B
[12398.441626] radeon 0000:01:00.0: SRBM_SOFT_RESET=0x00000100
[12398.442782] radeon 0000:01:00.0:   GRBM_STATUS               = 0x00003828
[12398.442783] radeon 0000:01:00.0:   GRBM_STATUS_SE0           = 0x00000007
[12398.442785] radeon 0000:01:00.0:   GRBM_STATUS_SE1           = 0x00000007
[12398.442787] radeon 0000:01:00.0:   SRBM_STATUS               = 0x200800C0
[12398.442789] radeon 0000:01:00.0:   SRBM_STATUS2              = 0x00000000
[12398.442791] radeon 0000:01:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[12398.442793] radeon 0000:01:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[12398.442795] radeon 0000:01:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[12398.442796] radeon 0000:01:00.0:   R_008680_CP_STAT          = 0x00000000
[12398.442798] radeon 0000:01:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[12398.442813] radeon 0000:01:00.0: GPU reset succeeded, trying to resume
[12398.512161] [drm] enabling PCIE gen 2 link speeds, disable with
radeon.pcie_gen2=0
[12398.516583] [drm] PCIE GART of 1024M enabled (table at 0x000000000025D000).

The bug vanishes if only suspend to ram or suspend to disk is used.

I tried to do a bisect, but a number of rc kernels seem to hang on me, long
before radeon module is loaded. At this point bisect

I suspected that the new async suspend/resume code might be at fault, as I was
seeing video card been turned off (aka monitor  going off) and then turning on
a moment before shutting down completely.

So I found the commits of the async suspend (from an article) and reverted
them.
Reverting 200421a80f6e0a9e39d698944cc35cba103eb6ce,
3c31b52f96f7b559d950b16113c0f68c72a1985e seems to avoid the above effect about
monitor turning off, on, then off again. But it does not fix the bug. 

Reverting 
7cd0602d7836c0056fe9bdab014d5ac5ec5cb291,
92858c476ec4e99cf0425f05dee109b6a55eb6f8 and
9e5e7910df824ba02aedd2b5d2ca556426ea6d0b,
76569faa62c46382e080c3e190c66e19515aae1c,
de377b3972729f00ee236ae4a97393e282ffe391,
28b6fd6e37792b16a56d324841bdb20ab78e4522,
a59ffb2062df3a5c346dbed931fa1e587fd0f0f3
doesn't affect the bug either, so I assume that this bug is not related to
suspend/resume async changes.

If you cannot reproduce the problem, please advice me what commits to revert.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/dri-devel/attachments/20140721/bb7bc160/attachment.html>