[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

Tue Nov 8 07:14:27 PST 2011

On Tue, Nov 08, 2011 at 03:33:03PM +0800, Chen Jie wrote:
> Hi,
> 
> Some status update.
> 在 2011年9月29日 下午5:17，Chen Jie <chenj at lemote.com> 写道：
> > Hi,
> > Add more information.
> > We got occasionally "GPU lockup" after resuming from suspend(on mipsel
> > platform with a mips64 compatible CPU and rs780e, the kernel is 3.1.0-rc8
> > 64bit).  Related kernel message:
> > /* return from STR */
> > [  156.152343] radeon 0000:01:05.0: WB enabled
> > [  156.187500] [drm] ring test succeeded in 0 usecs
> > [  156.187500] [drm] ib test succeeded in 0 usecs
> > [  156.398437] ata2: SATA link down (SStatus 0 SControl 300)
> > [  156.398437] ata3: SATA link down (SStatus 0 SControl 300)
> > [  156.398437] ata4: SATA link down (SStatus 0 SControl 300)
> > [  156.578125] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> > [  156.597656] ata1.00: configured for UDMA/133
> > [  156.613281] usb 1-5: reset high speed USB device number 4 using ehci_hcd
> > [  157.027343] usb 3-2: reset low speed USB device number 2 using ohci_hcd
> > [  157.609375] usb 3-3: reset low speed USB device number 3 using ohci_hcd
> > [  157.683593] r8169 0000:02:00.0: eth0: link up
> > [  165.621093] PM: resume of devices complete after 9679.556 msecs
> > [  165.628906] Restarting tasks ... done.
> > [  177.085937] radeon 0000:01:05.0: GPU lockup CP stall for more than
> > 10019msec
> > [  177.089843] ------------[ cut here ]------------
> > [  177.097656] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:267
> > radeon_fence_wait+0x25c/0x33c()
> > [  177.105468] GPU lockup (waiting for 0x000013C3 last fence id 0x000013AD)
> > [  177.113281] Modules linked in: psmouse serio_raw
> > [  177.117187] Call Trace:
> > [  177.121093] [<ffffffff806f3e7c>] dump_stack+0x8/0x34
> > [  177.125000] [<ffffffff8022e4f4>] warn_slowpath_common+0x78/0xa0
> > [  177.132812] [<ffffffff8022e5b8>] warn_slowpath_fmt+0x38/0x44
> > [  177.136718] [<ffffffff80522ed8>] radeon_fence_wait+0x25c/0x33c
> > [  177.144531] [<ffffffff804e9e70>] ttm_bo_wait+0x108/0x220
> > [  177.148437] [<ffffffff8053b478>] radeon_gem_wait_idle_ioctl+0x80/0x114
> > [  177.156250] [<ffffffff804d2fe8>] drm_ioctl+0x2e4/0x3fc
> > [  177.160156] [<ffffffff805a1820>] radeon_kms_compat_ioctl+0x28/0x38
> > [  177.167968] [<ffffffff80311a04>] compat_sys_ioctl+0x120/0x35c
> > [  177.171875] [<ffffffff80211d18>] handle_sys+0x118/0x138
> > [  177.179687] ---[ end trace 92f63d998efe4c6d ]---
> > [  177.187500] radeon 0000:01:05.0: GPU softreset
> > [  177.191406] radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xF57C2030
> > [  177.195312] radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00111103
> > [  177.203125] radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x20023040
> > [  177.363281] radeon 0000:01:05.0: Wait for MC idle timedout !
> > [  177.367187] radeon 0000:01:05.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
> > [  177.390625] radeon 0000:01:05.0: R_008020_GRBM_SOFT_RESET=0x00000001
> > [  177.414062] radeon 0000:01:05.0:   R_008010_GRBM_STATUS=0xA0003030
> > [  177.417968] radeon 0000:01:05.0:   R_008014_GRBM_STATUS2=0x00000003
> > [  177.425781] radeon 0000:01:05.0:   R_000E50_SRBM_STATUS=0x2002B040
> > [  177.433593] radeon 0000:01:05.0: GPU reset succeed
> > [  177.605468] radeon 0000:01:05.0: Wait for MC idle timedout !
> > [  177.761718] radeon 0000:01:05.0: Wait for MC idle timedout !
> > [  177.804687] radeon 0000:01:05.0: WB enabled
> > [  178.000000] [drm:r600_ring_test] *ERROR* radeon: ring test failed
> > (scratch(0x8504)=0xCAFEDEAD)
> After pinned ring in VRAM, it warned an ib test failure. It seems
> something wrong with accessing through GTT.
> 
> We dump gart table just after stopped cp, and compare gart table with
> the dumped one just after r600_pcie_gart_enable, and don't find any
> difference.
> 
> Any idea?
> 
> > [  178.007812] [drm:r600_resume] *ERROR* r600 startup failed on resume
> > [  178.988281] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule
> > IB(5).
> > [  178.996093] [drm:radeon_cs_ioctl] *ERROR* Failed to schedule IB !
> > [  179.003906] [drm:radeon_ib_schedule] *ERROR* radeon: couldn't schedule
> > IB(6).
> > ...
> 
> 

Do you have any kind of iommu ? Is the gart table programmed with proper
physical address for the page ? Is the GPU PCI master (iirc a PCI device
need to be master to be able initiate request to memory). Then there
could be a lot other PCI things getting in the way.

Cheers,
Jerome