[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

Wed Dec 7 06:21:15 PST 2011

2011/12/7  <chenhc at lemote.com>:
> When "MC timeout" happens at GPU reset, we found the 12th and 13th
> bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these
> two bits are like this:
> #define         G_000E50_MCDX_BUSY(x)              (((x) >> 12) & 1)
> #define         G_000E50_MCDW_BUSY(x)              (((x) >> 13) & 1)
>
> Could you please tell me what does they mean? And if possible,

They refer to sub-blocks in the memory controller.  I don't really
know off hand what the name mean.

> I want to know the functionalities of these 5 registers in detail:
> #define R_000E60_SRBM_SOFT_RESET                       0x0E60
> #define R_000E50_SRBM_STATUS                           0x0E50
> #define R_008020_GRBM_SOFT_RESET                0x8020
> #define R_008010_GRBM_STATUS                    0x8010
> #define R_008014_GRBM_STATUS2                   0x8014
>
> A bit more info: If I reset the MC after resetting CP (this is what
> Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will
> disappear, but there is still "ring test failed".

The bits are defined in r600d.h.  As to the acronyms:
BIF - Bus InterFace
CG - clocks
DC - Display Controller
GRBM - Graphics block (3D engine)
HDP - Host Data Path (CPU access to vram via the PCI BAR)
IH, RLC - Interrupt controller
MC - Memory controller
ROM - ROM
SEM - semaphore controller

When you reset the MC, you will probably have to reset just about
everything else since most blocks depend on the MC for access to
memory.  If you do reset the MC, you should do it at prior to calling
asic_init so you make sure all the hw gets re-initialized properly.
Additionally, you should probably reset the GRBM either via
SRBM_SOFT_RESET or the individual sub-blocks via GRBM_SOFT_RESET.

Alex

>
> Huacai Chen
>
>> 2011/11/8  <chenhc at lemote.com>:
>>> And, I want to know something:
>>> 1, Does GPU use MC to access GTT?
>>
>> Yes.  All GPU clients (display, 3D, etc.) go through the MC to access
>> memory (vram or gart).
>>
>>> 2, What can cause MC timeout？
>>
>> Lots of things.  Some GPU client still active, some GPU client hung or
>> not properly initialized.
>>
>> Alex
>>
>>>
>>>> Hi,
>>>>
>>>> Some status update.
>>>> 在 2011年9月29日 下午5:17，Chen Jie <chenj at lemote.com> 写道：
>>>>> Hi,
>>>>> Add more information.
>>>>> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
>>>>> platform with a mips64 compatible CPU and rs780e, the kernel is
>>>>> 3.1.0-rc8
>>>>> 64bit).  Related kernel message:
>>>>> /* return from STR */
>>>>> [  156.152343] radeon 0000:01:05.0: WB enabled
>>>>> [  156.187500] [drm] ring test succeeded in 0 usecs
>>>>> [  156.187500] [drm] ib test succeeded in 0 usecs
>>>>> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
>>>>> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
>>>>> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
>>>>> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>>>>> [  156.597656] ata1.00: configured for UDMA/133
>>>>> [  156.613281] usb 1-5: reset high speed USB device number 4 using
>>>>> ehci_hcd
>>>>> [  157.027343] usb 3-2: reset low speed USB device number 2 using
>>>>> ohci_hcd
>>>>> [  157.609375] usb 3-3: reset low speed USB device number 3 using
>>>>> ohci_hcd
>>>>> [  157.683593] r8169 0000:02:00.0: eth0: link up
>>>>> [  165.621093] PM: resume of devices complete after 9679.556 msecs
>>>>> [  165.628906] Restarting tasks ... done.
>>>>> [  177.085937] radeon 0000:01:05.0: GPU lockup CP stall for more than
>>>>> 10019msec
>>>>> [  177.089843] ------------[ cut here ]------------
>>>>> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
>>>>> radeon_fence_wait+0x25c/0x33c()
>>>>> [  177.105468] GPU lockup (waiting for 0x000013C3 last fence id
>>>>> 0x000013AD)
>>>>> [  177.113281] Modules linked in: psmouse serio_raw
>>>>> [  177.117187] Call Trace:
>>>>> [  177.121093] [<ffffffff806f3e7c>] dump_stack+0x8/0x34
>>>>> [  177.125000] [<ffffffff8022e4f4>] warn_slowpath_common+0x78/0xa0
>>>>> [  177.132812] [<ffffffff8022e5b8>] warn_slowpath_fmt+0x38/0x44
>>>>> [  177.136718] [<ffffffff80522ed8>] radeon_fence_wait+0x25c/0x33c
>>>>> [  177.144531] [<ffffffff804e9e70>] ttm_bo_wait+0x108/0x220
>>>>> [  177.148437] [<ffffffff8053b478>]
>>>>> radeon_gem_wait_idle_ioctl+0x80/0x114
>>>>> [  177.156250] [<ffffffff804d2fe8>] drm_ioctl+0x2e4/0x3fc
>>>>> [  177.160156] [<ffffffff805a1820>] radeon_kms_compat_ioctl+0x28/0x38
>>>>> [  177.167968] [<ffffffff80311a04>] compat_sys_ioctl+0x120/0x35c
>>>>> [  177.171875] [<ffffffff80211d18>] handle_sys+0x118/0x138
>>>>> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
>>>>> [  177.187500] radeon 0000:01:05.0: GPU softreset
>>>>> [  177.191406] radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
>>>>> [  177.195312] radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00111103
>>>>> [  177.203125] radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x20023040
>>>>> [  177.363281] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>>> [  177.367187] radeon 0000:01:05.0:
>>>>> R_008020_GRBM_SOFT_RESET=0x00007FEE
>>>>> [  177.390625] radeon 0000:01:05.0:
>>>>> R_008020_GRBM_SOFT_RESET=0x00000001
>>>>> [  177.414062] radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xA0003030
>>>>> [  177.417968] radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00000003
>>>>> [  177.425781] radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
>>>>> [  177.433593] radeon 0000:01:05.0: GPU reset succeed
>>>>> [  177.605468] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>>> [  177.761718] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>>> [  177.804687] radeon 0000:01:05.0: WB enabled
>>>>> [  178.000000] [drm:r600_ring_test] *ERROR* radeon: ring test failed
>>>>> (scratch(0x8504)=0xCAFEDEAD)
>>>> After pinned ring in VRAM, it warned an ib test failure. It seems
>>>> something wrong with accessing through GTT.
>>>>
>>>> We dump gart table just after stopped cp, and compare gart table with
>>>> the dumped one just after r600_pcie_gart_enable, and don't find any
>>>> difference.
>>>>
>>>> Any idea?
>>>>
>>>>> [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
>>>>> [  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>>>>> schedule
>>>>> IB(5).
>>>>> [  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
>>>>> [  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>>>>> schedule
>>>>> IB(6).
>>>>> ...
>>>>
>>>>
>>>>
>>>> Regards,
>>>> -- Chen Jie
>>>>
>>>
>>>
>>>
>>
>
>
>