[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

Tue Feb 28 20:49:31 PST 2012

> On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
>> 在 2012年2月17日 下午5:27，Chen Jie <chenj at lemote.com> 写道：
>> >> One good way to test gart is to go over GPU gart table and write a
>> >> dword using the GPU at end of each page something like 0xCAFEDEAD
>> >> or somevalue that is unlikely to be already set. And then go over
>> >> all the page and check that GPU write succeed. Abusing the scratch
>> >> register write back feature is the easiest way to try that.
>> > I'm planning to add a GART table check procedure when resume, which
>> > will go over GPU gart table:
>> > 1. read(backup) a dword at end of each GPU page
>> > 2. write a mark by GPU and check it
>> > 3. restore the original dword
>> Attachment validateGART.patch do the job:
>> * It current only works for mips64 platform.
>> * To use it, apply all_in_vram.patch first, which will allocate CP
>> ring, ih, ib in VRAM and hard code no_wb=1.
>>
>> The gart test routine will be invoked in r600_resume. We've tried it,
>> and find that when lockup happened the gart table was good before
>> userspace restarting. The related dmesg follows:
>> [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
>> at 9000000040040000, 32768 entries, Dummy
>> Page[0x000000000e004000-0x000000000e007fff]
>> [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
>> entries(valid=8544, invalid=24224, total=32768).
>> ...
>> [ 1531.156250] PM: resume of devices complete after 9396.588 msecs
>> [ 1532.152343] Restarting tasks ... done.
>> [ 1544.468750] radeon 0000:01:05.0: GPU lockup CP stall for more than
>> 10003msec
>> [ 1544.472656] ------------[ cut here ]------------
>> [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
>> radeon_fence_wait+0x25c/0x314()
>> [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id
>> 0x0002136A)
>> ...
>> [ 1544.886718] radeon 0000:01:05.0: Wait for MC idle timedout !
>> [ 1545.046875] radeon 0000:01:05.0: Wait for MC idle timedout !
>> [ 1545.062500] radeon 0000:01:05.0: WB disabled
>> [ 1545.097656] [drm] ring test succeeded in 0 usecs
>> [ 1545.105468] [drm] ib test succeeded in 0 usecs
>> [ 1545.109375] [drm] Enabling audio support
>> [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
>> at 9000000040040000, 32768 entries, Dummy
>> Page[0x000000000e004000-0x000000000e007fff]
>> [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
>> unexpected value 0x745aaad1(expect 0xDEADBEEF)
>> entry=0x000000000e008067, orignal=0x745aaad1
>> ...
>> /* System blocked here. */
>>
>> Any idea?
>
> I know lockup are frustrating, my only idea is the memory controller
> is lockup because of some failing pci <-> system ram transaction.
>
>>
>> BTW, we find the following in r600_pcie_gart_enable()
>> (drivers/gpu/drm/radeon/r600.c):
>> WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
>> (u32)(rdev->dummy_page.addr >> 12));
>>
>> On our platform, PAGE_SIZE is 16K, does it have any problem?
>
> No this should be handled properly.
>
>> Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
>> should change to:
>>   for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
>>           radeon_gart_set_page(rdev, t, page_base);
>> -         page_base += RADEON_GPU_PAGE_SIZE;
>> +         if (page_base != rdev->dummy_page.addr)
>> +                 page_base += RADEON_GPU_PAGE_SIZE;
>>   }
>> ???
>
> No need to do so, dummy page will be 16K too, so it's fine.
Really? When CPU page is 16K and GPU page is 4k, suppose the dummy page
is 0x8e004000, then there are four types of address in GART:0x8e004000,
0x8e005000, 0x8e006000, 0x8e007000. The value which written in
VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is 0x8e004 (0x8e004000<<12). I
don't know how VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR works, but I
think 0x8e005000, 0x8e006000 and 0x8e007000 cannot be handled correctly.

>
> Cheers,
> Jerome
>
>

Huacai Chen