[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

Wed Dec 7 03:48:08 PST 2011

When "MC timeout" happens at GPU reset, we found the 12th and 13th
bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these
two bits are like this:
#define         G_000E50_MCDX_BUSY(x)              (((x) >> 12) & 1)
#define         G_000E50_MCDW_BUSY(x)              (((x) >> 13) & 1)

Could you please tell me what does they mean? And if possible,
I want to know the functionalities of these 5 registers in detail:
#define R_000E60_SRBM_SOFT_RESET                       0x0E60
#define R_000E50_SRBM_STATUS                           0x0E50
#define R_008020_GRBM_SOFT_RESET                0x8020
#define R_008010_GRBM_STATUS                    0x8010
#define R_008014_GRBM_STATUS2                   0x8014

A bit more info: If I reset the MC after resetting CP (this is what
Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will
disappear, but there is still "ring test failed".

Huacai Chen

> 2011/11/8  <chenhc at lemote.com>:
>> And, I want to know something:
>> 1, Does GPU use MC to access GTT?
>
> Yes.  All GPU clients (display, 3D, etc.) go through the MC to access
> memory (vram or gart).
>
>> 2, What can cause MC timeout？
>
> Lots of things.  Some GPU client still active, some GPU client hung or
> not properly initialized.
>
> Alex
>
>>
>>> Hi,
>>>
>>> Some status update.
>>> 在 2011年9月29日 下午5:17，Chen Jie <chenj at lemote.com> 写道：
>>>> Hi,
>>>> Add more information.
>>>> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
>>>> platform with a mips64 compatible CPU and rs780e, the kernel is
>>>> 3.1.0-rc8
>>>> 64bit).  Related kernel message:
>>>> /* return from STR */
>>>> [  156.152343] radeon 0000:01:05.0: WB enabled
>>>> [  156.187500] [drm] ring test succeeded in 0 usecs
>>>> [  156.187500] [drm] ib test succeeded in 0 usecs
>>>> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
>>>> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
>>>> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
>>>> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>>>> [  156.597656] ata1.00: configured for UDMA/133
>>>> [  156.613281] usb 1-5: reset high speed USB device number 4 using
>>>> ehci_hcd
>>>> [  157.027343] usb 3-2: reset low speed USB device number 2 using
>>>> ohci_hcd
>>>> [  157.609375] usb 3-3: reset low speed USB device number 3 using
>>>> ohci_hcd
>>>> [  157.683593] r8169 0000:02:00.0: eth0: link up
>>>> [  165.621093] PM: resume of devices complete after 9679.556 msecs
>>>> [  165.628906] Restarting tasks ... done.
>>>> [  177.085937] radeon 0000:01:05.0: GPU lockup CP stall for more than
>>>> 10019msec
>>>> [  177.089843] ------------[ cut here ]------------
>>>> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
>>>> radeon_fence_wait+0x25c/0x33c()
>>>> [  177.105468] GPU lockup (waiting for 0x000013C3 last fence id
>>>> 0x000013AD)
>>>> [  177.113281] Modules linked in: psmouse serio_raw
>>>> [  177.117187] Call Trace:
>>>> [  177.121093] [<ffffffff806f3e7c>] dump_stack+0x8/0x34
>>>> [  177.125000] [<ffffffff8022e4f4>] warn_slowpath_common+0x78/0xa0
>>>> [  177.132812] [<ffffffff8022e5b8>] warn_slowpath_fmt+0x38/0x44
>>>> [  177.136718] [<ffffffff80522ed8>] radeon_fence_wait+0x25c/0x33c
>>>> [  177.144531] [<ffffffff804e9e70>] ttm_bo_wait+0x108/0x220
>>>> [  177.148437] [<ffffffff8053b478>]
>>>> radeon_gem_wait_idle_ioctl+0x80/0x114
>>>> [  177.156250] [<ffffffff804d2fe8>] drm_ioctl+0x2e4/0x3fc
>>>> [  177.160156] [<ffffffff805a1820>] radeon_kms_compat_ioctl+0x28/0x38
>>>> [  177.167968] [<ffffffff80311a04>] compat_sys_ioctl+0x120/0x35c
>>>> [  177.171875] [<ffffffff80211d18>] handle_sys+0x118/0x138
>>>> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
>>>> [  177.187500] radeon 0000:01:05.0: GPU softreset
>>>> [  177.191406] radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
>>>> [  177.195312] radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00111103
>>>> [  177.203125] radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x20023040
>>>> [  177.363281] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>> [  177.367187] radeon 0000:01:05.0:
>>>> R_008020_GRBM_SOFT_RESET=0x00007FEE
>>>> [  177.390625] radeon 0000:01:05.0:
>>>> R_008020_GRBM_SOFT_RESET=0x00000001
>>>> [  177.414062] radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xA0003030
>>>> [  177.417968] radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00000003
>>>> [  177.425781] radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
>>>> [  177.433593] radeon 0000:01:05.0: GPU reset succeed
>>>> [  177.605468] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>> [  177.761718] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>> [  177.804687] radeon 0000:01:05.0: WB enabled
>>>> [  178.000000] [drm:r600_ring_test] *ERROR* radeon: ring test failed
>>>> (scratch(0x8504)=0xCAFEDEAD)
>>> After pinned ring in VRAM, it warned an ib test failure. It seems
>>> something wrong with accessing through GTT.
>>>
>>> We dump gart table just after stopped cp, and compare gart table with
>>> the dumped one just after r600_pcie_gart_enable, and don't find any
>>> difference.
>>>
>>> Any idea?
>>>
>>>> [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
>>>> [  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>>>> schedule
>>>> IB(5).
>>>> [  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
>>>> [  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>>>> schedule
>>>> IB(6).
>>>> ...
>>>
>>>
>>>
>>> Regards,
>>> -- Chen Jie
>>>
>>
>>
>>
>