[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

Wed Feb 15 01:32:35 PST 2012

Hi,

Status update about the problem 'Occasionally "GPU lockup" after
resuming from suspend.'

First, this could happen when system returns from a STR(suspend to
ram) or STD(suspend to disk, aka hibernation).
When returns from STD, the initialization process is most similar to
the normal boot.
The standby is ok, which is similar to STR, except that standby will
not shutdown the power of CPU,GPU etc.

We've dumped and compared the registers, and found something:
CP_STAT
normal value: 0x00000000
value when this problem occurred: 0x802100C1 or 0x802300C1

CP_ME_CNTL
normal value: 0x000000FF
value when this problem occurred: always 0x200000FF in our test

Questions:
According to the manual,
CP_STAT = 0x802100C1 means
	CSF_RING_BUSY(bit 0):
		The Ring fetcher still has command buffer data to fetch, or the PFP
still has data left to process from the reorder queue.
	CSF_BUSY(bit 6):
		The input FIFOs have command buffers to fetch, or one or more of the
fetchers are busy, or the arbiter has a request to send to the MIU.
	MIU_RDREQ_BUSY(bit 7):
		The read path logic inside the MIU is busy.
	MEQ_BUSY(bit 16):
		The PFP-to-ME queue has valid data in it.
	SURFACE_SYNC_BUSY(bit 21):
		The Surface Sync unit is busy.
	CP_BUSY(bit 31):
		Any block in the CP is busy.
What does it suggest?

What does it mean if bit 29 of CP_ME_CNTL is set?

BTW, how does the dummy page work in GART?

Regards,
-- Chen Jie

在 2011年12月7日 下午10:21，Alex Deucher <alexdeucher at gmail.com> 写道：
> 2011/12/7  <chenhc at lemote.com>:
>> When "MC timeout" happens at GPU reset, we found the 12th and 13th
>> bits of R_000E50_SRBM_STATUS is 1. From kernel code we found these
>> two bits are like this:
>> #define         G_000E50_MCDX_BUSY(x)              (((x) >> 12) & 1)
>> #define         G_000E50_MCDW_BUSY(x)              (((x) >> 13) & 1)
>>
>> Could you please tell me what does they mean? And if possible,
>
> They refer to sub-blocks in the memory controller.  I don't really
> know off hand what the name mean.
>
>> I want to know the functionalities of these 5 registers in detail:
>> #define R_000E60_SRBM_SOFT_RESET                       0x0E60
>> #define R_000E50_SRBM_STATUS                           0x0E50
>> #define R_008020_GRBM_SOFT_RESET                0x8020
>> #define R_008010_GRBM_STATUS                    0x8010
>> #define R_008014_GRBM_STATUS2                   0x8014
>>
>> A bit more info: If I reset the MC after resetting CP (this is what
>> Linux-2.6.34 does, but removed since 2.6.35), then "MC timeout" will
>> disappear, but there is still "ring test failed".
>
> The bits are defined in r600d.h.  As to the acronyms:
> BIF - Bus InterFace
> CG - clocks
> DC - Display Controller
> GRBM - Graphics block (3D engine)
> HDP - Host Data Path (CPU access to vram via the PCI BAR)
> IH, RLC - Interrupt controller
> MC - Memory controller
> ROM - ROM
> SEM - semaphore controller
>
> When you reset the MC, you will probably have to reset just about
> everything else since most blocks depend on the MC for access to
> memory.  If you do reset the MC, you should do it at prior to calling
> asic_init so you make sure all the hw gets re-initialized properly.
> Additionally, you should probably reset the GRBM either via
> SRBM_SOFT_RESET or the individual sub-blocks via GRBM_SOFT_RESET.
>
> Alex
>
>>
>> Huacai Chen
>>
>>> 2011/11/8  <chenhc at lemote.com>:
>>>> And, I want to know something:
>>>> 1, Does GPU use MC to access GTT?
>>>
>>> Yes.  All GPU clients (display, 3D, etc.) go through the MC to access
>>> memory (vram or gart).
>>>
>>>> 2, What can cause MC timeout？
>>>
>>> Lots of things.  Some GPU client still active, some GPU client hung or
>>> not properly initialized.
>>>
>>> Alex
>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> Some status update.
>>>>> 在 2011年9月29日 下午5:17，Chen Jie <chenj at lemote.com> 写道：
>>>>>> Hi,
>>>>>> Add more information.
>>>>>> We got occasionally "GPU lockup" after resuming from suspend(on mipsel
>>>>>> platform with a mips64 compatible CPU and rs780e, the kernel is
>>>>>> 3.1.0-rc8
>>>>>> 64bit).  Related kernel message:
>>>>>> /* return from STR */
>>>>>> [  156.152343] radeon 0000:01:05.0: WB enabled
>>>>>> [  156.187500] [drm] ring test succeeded in 0 usecs
>>>>>> [  156.187500] [drm] ib test succeeded in 0 usecs
>>>>>> [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
>>>>>> [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
>>>>>> [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
>>>>>> [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>>>>>> [  156.597656] ata1.00: configured for UDMA/133
>>>>>> [  156.613281] usb 1-5: reset high speed USB device number 4 using
>>>>>> ehci_hcd
>>>>>> [  157.027343] usb 3-2: reset low speed USB device number 2 using
>>>>>> ohci_hcd
>>>>>> [  157.609375] usb 3-3: reset low speed USB device number 3 using
>>>>>> ohci_hcd
>>>>>> [  157.683593] r8169 0000:02:00.0: eth0: link up
>>>>>> [  165.621093] PM: resume of devices complete after 9679.556 msecs
>>>>>> [  165.628906] Restarting tasks ... done.
>>>>>> [  177.085937] radeon 0000:01:05.0: GPU lockup CP stall for more than
>>>>>> 10019msec
>>>>>> [  177.089843] ------------[ cut here ]------------
>>>>>> [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
>>>>>> radeon_fence_wait+0x25c/0x33c()
>>>>>> [  177.105468] GPU lockup (waiting for 0x000013C3 last fence id
>>>>>> 0x000013AD)
>>>>>> [  177.113281] Modules linked in: psmouse serio_raw
>>>>>> [  177.117187] Call Trace:
>>>>>> [  177.121093] [<ffffffff806f3e7c>] dump_stack+0x8/0x34
>>>>>> [  177.125000] [<ffffffff8022e4f4>] warn_slowpath_common+0x78/0xa0
>>>>>> [  177.132812] [<ffffffff8022e5b8>] warn_slowpath_fmt+0x38/0x44
>>>>>> [  177.136718] [<ffffffff80522ed8>] radeon_fence_wait+0x25c/0x33c
>>>>>> [  177.144531] [<ffffffff804e9e70>] ttm_bo_wait+0x108/0x220
>>>>>> [  177.148437] [<ffffffff8053b478>]
>>>>>> radeon_gem_wait_idle_ioctl+0x80/0x114
>>>>>> [  177.156250] [<ffffffff804d2fe8>] drm_ioctl+0x2e4/0x3fc
>>>>>> [  177.160156] [<ffffffff805a1820>] radeon_kms_compat_ioctl+0x28/0x38
>>>>>> [  177.167968] [<ffffffff80311a04>] compat_sys_ioctl+0x120/0x35c
>>>>>> [  177.171875] [<ffffffff80211d18>] handle_sys+0x118/0x138
>>>>>> [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
>>>>>> [  177.187500] radeon 0000:01:05.0: GPU softreset
>>>>>> [  177.191406] radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
>>>>>> [  177.195312] radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00111103
>>>>>> [  177.203125] radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x20023040
>>>>>> [  177.363281] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>>>> [  177.367187] radeon 0000:01:05.0:
>>>>>> R_008020_GRBM_SOFT_RESET=0x00007FEE
>>>>>> [  177.390625] radeon 0000:01:05.0:
>>>>>> R_008020_GRBM_SOFT_RESET=0x00000001
>>>>>> [  177.414062] radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xA0003030
>>>>>> [  177.417968] radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00000003
>>>>>> [  177.425781] radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
>>>>>> [  177.433593] radeon 0000:01:05.0: GPU reset succeed
>>>>>> [  177.605468] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>>>> [  177.761718] radeon 0000:01:05.0: Wait for MC idle timedout !
>>>>>> [  177.804687] radeon 0000:01:05.0: WB enabled
>>>>>> [  178.000000] [drm:r600_ring_test] *ERROR* radeon: ring test failed
>>>>>> (scratch(0x8504)=0xCAFEDEAD)
>>>>> After pinned ring in VRAM, it warned an ib test failure. It seems
>>>>> something wrong with accessing through GTT.
>>>>>
>>>>> We dump gart table just after stopped cp, and compare gart table with
>>>>> the dumped one just after r600_pcie_gart_enable, and don't find any
>>>>> difference.
>>>>>
>>>>> Any idea?
>>>>>
>>>>>> [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
>>>>>> [  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>>>>>> schedule
>>>>>> IB(5).
>>>>>> [  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
>>>>>> [  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't
>>>>>> schedule
>>>>>> IB(6).
>>>>>> ...