[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

Thu Dec 8 03:35:47 PST 2011

Thank you for your reply.

I found CP_RB_WPTR has changed when "ring test failed", so I think CP is
active, but what it get from ring buffer is wrong. Then, I want to know
whether there is a way to check the content that GPU get from ring buffer.

BTW, when I use "echo shutdown > /sys/power/disk; echo disk >
/sys/power/state" to do a hibernation, there will be occasionally "GPU
reset" just like suspend. However, if I use "echo reboot >
/sys/power/disk; echo disk > /sys/power/state" to do a hibernation and
wakeup automatically, there is no "GPU reset" after hundreds of tests.
What does this imply? Power loss cause something break?

Best regards,

Huacai Chen

> 2011/12/7  <chenhc at lemote.com>:
>> When "MC timeout" happens at GPU reset, we found the 12th and 13th
>> bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these
>> two bits are like this:
>> #define         G_000E50_MCDX_BUSY(x)              (((x) >> 12) & 1)
>> #define         G_000E50_MCDW_BUSY(x)              (((x) >> 13) & 1)
>>
>> Could you please tell me what does they mean? And if possible,
>
> They refer to sub-blocks in the memory controller.  I don't really
> know off hand what the name mean.
>
>> I want to know the functionalities of these 5 registers in detail:
>> #define R_000E60_SRBM_SOFT_RESET                       0x0E60
>> #define R_000E50_SRBM_STATUS                           0x0E50
>> #define R_008020_GRBM_SOFT_RESET                0x8020
>> #define R_008010_GRBM_STATUS                    0x8010
>> #define R_008014_GRBM_STATUS2                   0x8014
>>
>> A bit more info: If I reset the MC after resetting CP (this is what
>> Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will
>> disappear, but there is still "ring test failed".
>
> The bits are defined in r600d.h.  As to the acronyms:
> BIF - Bus InterFace
> CG - clocks
> DC - Display Controller
> GRBM - Graphics block (3D engine)
> HDP - Host Data Path (CPU access to vram via the PCI BAR)
> IH, RLC - Interrupt controller
> MC - Memory controller
> ROM - ROM
> SEM - semaphore controller
>
> When you reset the MC, you will probably have to reset just about
> everything else since most blocks depend on the MC for access to
> memory.  If you do reset the MC, you should do it at prior to calling
> asic_init so you make sure all the hw gets re-initialized properly.
> Additionally, you should probably reset the GRBM either via
> SRBM_SOFT_RESET or the individual sub-blocks via GRBM_SOFT_RESET.
>
> Alex
>
>>
>> Huacai Chen
>>
>>> 2011/11/8  <chenhc at lemote.com>:
>>>> And, I want to know something:
>>>> 1, Does GPU use MC to access GTT?
>>>
>>> Yes.  All GPU clients (display, 3D, etc.) go through the MC to access
>>> memory (vram or gart).
>>>
>>>> 2, What can cause MC timeout？
>>>
>>> Lots of things.  Some GPU client still active, some GPU client hung or
>>> not properly initialized.
>>>
>>> Alex
>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> Some status update.
>>>>> 在 2011年9月29日 下午5:17，Chen Jie <chenj at lemote.com> 写道：
>>>>>> Hi,
>>>>>> Add more information.
>>>>>> We got occasionally "GPU lockup" after resuming from suspend(on
>>>>>> mipsel
>>>>>> platform with a mips64 compatible CPU and rs780e, the kernel is
>>>>>> 3.1.0-rc8
>>>>>> 64bit).  Related kernel message:
>>>>>> /* return from STR */
>>>>>> [  156.152343] radeon 0000:01:05.0: WB enabled
>>>>>> [  156.187500] [drm] ring test succeeded in 0 usecs
>>>>>> [  156.187500] [drm] ib test succeeded in 0 usecs
>>>>>> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
>>>>>> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
>>>>>> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
>>>>>> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl
>>>>>> 300)
>>>>>> [  156.597656] ata1.00: configured for UDMA/133
>>>>>> [  156.613281] usb 1-5: reset high speed USB device number 4 using
>>>>>> ehci_hcd
>>>>>> [  157.027343] usb 3-2: reset low speed USB device number 2 using
>>>>>> ohci_hcd
>>>>>> [  157.609375] usb 3-3: reset low speed USB device number 3 using
>>>>>> ohci_hcd
>>>>>> [  157.683593] r8169 0000:02:00.0: eth0: link up
>>>>>> [  165.621093] PM: resume of devices complete after 9679.556 msecs
>>>>>> [  165.628906] Restarting tasks ... done.
>>>>>> [  177.085937] radeon 0000:01:05.0: GPU lockup CP stall for more
>>>>>> than
>>>>>> 10019msec
>>>>>> [  177.089843] ------------[ cut here ]------------
>>>>>> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
>>>>>> radeon_fence_wait+0x25c/0x33c()
>>>>>> [  177.105468] GPU lockup (waiting for 0x000013C3 last fence id
>>>>>> 0x000013AD)
>>>>>> [  177.113281] Modules linked in: psmouse serio_raw
>>>>>> [  177.117187] Call Trace:
>>>>>> [  177.121093] [<ffffffff806f3e7c>] dump_stack+0x8/0x34
>>>>>> [  177.125000] [<ffffffff8022e4f4>] warn_slowpath_common+0x78/0xa0
>>>>>> [  177.132812] [<ffffffff8022e5b8>] warn_slowpath_fmt+0x38/0x44
>>>>>> [  177.136718] [<ffffffff80522ed8>] radeon_fence_wait+0x25c/0x33c
>>>>>> [  177.144531] [<ffffffff804e9e70>] ttm_bo_wait+0x108/0x220
>>>>>> [  177.148437] [<ffffffff8053b478>]
>>>>>> radeon_gem_wait_idle_ioctl+0x80/0x114
>>>>>> [  177.156250] [<ffffffff804d2fe8>] drm_ioctl+0x2e4/0x3fc
>>>>>> [  177.160156] [<ffffffff805a1820>]
>>>>>> radeon_kms_compat_ioctl+0x28/0x38
>>>>>> [  177.167968] [<ffffffff80311a04>] compat_sys_ioctl+0x120/0x35c
>>>>>> [  177.171875] [<ffffffff80211d18>] handle_sys+0x118/0x138
>>>>>> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
>>>>>> [  177.187500] radeon 0000:01:05.0: GPU softreset
>>>>>> [  177.191406] radeon 0000:01:05.0:
>>>>>> R_008010_GRBM_STATUS=0xF57C2030
>>>>>> [  177.195312] radeon 0000:01:05.0:
>>>>>> R_008014_GRBM_STATUS2=0x00111103
>>>>>> [  177.203125] radeon 0000:01:05.0:
>>>>>> R_000E50_SRBM_STATUS=0x20023040
>>>>>> [  177.363281] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>>>> [  177.367187] radeon 0000:01:05.0:
>>>>>> R_008020_GRBM_SOFT_RESET=0x00007FEE
>>>>>> [  177.390625] radeon 0000:01:05.0:
>>>>>> R_008020_GRBM_SOFT_RESET=0x00000001
>>>>>> [  177.414062] radeon 0000:01:05.0:
>>>>>> R_008010_GRBM_STATUS=0xA0003030
>>>>>> [  177.417968] radeon 0000:01:05.0:
>>>>>> R_008014_GRBM_STATUS2=0x00000003
>>>>>> [  177.425781] radeon 0000:01:05.0:
>>>>>> R_000E50_SRBM_STATUS=0x2002B040
>>>>>> [  177.433593] radeon 0000:01:05.0: GPU reset succeed
>>>>>> [  177.605468] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>>>> [  177.761718] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>>>> [  177.804687] radeon 0000:01:05.0: WB enabled
>>>>>> [  178.000000] [drm:r600_ring_test] *ERROR* radeon: ring test failed
>>>>>> (scratch(0x8504)=0xCAFEDEAD)
>>>>> After pinned ring in VRAM, it warned an ib test failure. It seems
>>>>> something wrong with accessing through GTT.
>>>>>
>>>>> We dump gart table just after stopped cp, and compare gart table with
>>>>> the dumped one just after r600_pcie_gart_enable, and don't find any
>>>>> difference.
>>>>>
>>>>> Any idea?
>>>>>
>>>>>> [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on
>>>>>> resume
>>>>>> [  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>>>>>> schedule
>>>>>> IB(5).
>>>>>> [  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
>>>>>> [  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>>>>>> schedule
>>>>>> IB(6).
>>>>>> ...
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>> -- Chen Jie
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>