[mipsel+rs780e]Occasionally "GPU lockup" after resuming from suspend.

Thu Mar 1 09:19:43 PST 2012

2012/3/1  <chenhc at lemote.com>:
> Status update:
> In r600.c I found for RS780, num_*_threads are like this:
>            sq_thread_resource_mgmt = (NUM_PS_THREADS(79) |
>                                       NUM_VS_THREADS(78) |
>                                       NUM_GS_THREADS(4) |
>                                       NUM_ES_THREADS(31));
>
> But in documents, each of them should be a multiple of 4. And in
> r600_blit_kms.c， they are 136, 48, 4, 4. I want to know why
> 79, 78, 4 and 31 are use here.

You can try changing them, but I don't think it will make a difference.

Alex

>
> Huacai Chen
>
>> On Wed, 2012-02-29 at 12:49 +0800, chenhc at lemote.com wrote:
>>> > On Tue, 2012-02-21 at 18:37 +0800, Chen Jie wrote:
>>> >> 在 2012年2月17日 下午5:27，Chen Jie <chenj at lemote.com> 写道：
>>> >> >> One good way to test gart is to go over GPU gart table and write a
>>> >> >> dword using the GPU at end of each page something like 0xCAFEDEAD
>>> >> >> or somevalue that is unlikely to be already set. And then go over
>>> >> >> all the page and check that GPU write succeed. Abusing the scratch
>>> >> >> register write back feature is the easiest way to try that.
>>> >> > I'm planning to add a GART table check procedure when resume, which
>>> >> > will go over GPU gart table:
>>> >> > 1. read(backup) a dword at end of each GPU page
>>> >> > 2. write a mark by GPU and check it
>>> >> > 3. restore the original dword
>>> >> Attachment validateGART.patch do the job:
>>> >> * It current only works for mips64 platform.
>>> >> * To use it, apply all_in_vram.patch first, which will allocate CP
>>> >> ring, ih, ib in VRAM and hard code no_wb=1.
>>> >>
>>> >> The gart test routine will be invoked in r600_resume. We've tried it,
>>> >> and find that when lockup happened the gart table was good before
>>> >> userspace restarting. The related dmesg follows:
>>> >> [ 1521.820312] [drm] r600_gart_table_validate(): Validate GART Table
>>> >> at 9000000040040000, 32768 entries, Dummy
>>> >> Page[0x000000000e004000-0x000000000e007fff]
>>> >> [ 1522.019531] [drm] r600_gart_table_validate(): Sweep 32768
>>> >> entries(valid=8544, invalid=24224, total=32768).
>>> >> ...
>>> >> [ 1531.156250] PM: resume of devices complete after 9396.588 msecs
>>> >> [ 1532.152343] Restarting tasks ... done.
>>> >> [ 1544.468750] radeon 0000:01:05.0: GPU lockup CP stall for more than
>>> >> 10003msec
>>> >> [ 1544.472656] ------------[ cut here ]------------
>>> >> [ 1544.480468] WARNING: at drivers/gpu/drm/radeon/radeon_fence.c:243
>>> >> radeon_fence_wait+0x25c/0x314()
>>> >> [ 1544.488281] GPU lockup (waiting for 0x0002136B last fence id
>>> >> 0x0002136A)
>>> >> ...
>>> >> [ 1544.886718] radeon 0000:01:05.0: Wait for MC idle timedout !
>>> >> [ 1545.046875] radeon 0000:01:05.0: Wait for MC idle timedout !
>>> >> [ 1545.062500] radeon 0000:01:05.0: WB disabled
>>> >> [ 1545.097656] [drm] ring test succeeded in 0 usecs
>>> >> [ 1545.105468] [drm] ib test succeeded in 0 usecs
>>> >> [ 1545.109375] [drm] Enabling audio support
>>> >> [ 1545.113281] [drm] r600_gart_table_validate(): Validate GART Table
>>> >> at 9000000040040000, 32768 entries, Dummy
>>> >> Page[0x000000000e004000-0x000000000e007fff]
>>> >> [ 1545.125000] [drm:r600_gart_table_validate] *ERROR* Iter=0:
>>> >> unexpected value 0x745aaad1(expect 0xDEADBEEF)
>>> >> entry=0x000000000e008067, orignal=0x745aaad1
>>> >> ...
>>> >> /* System blocked here. */
>>> >>
>>> >> Any idea?
>>> >
>>> > I know lockup are frustrating, my only idea is the memory controller
>>> > is lockup because of some failing pci <-> system ram transaction.
>>> >
>>> >>
>>> >> BTW, we find the following in r600_pcie_gart_enable()
>>> >> (drivers/gpu/drm/radeon/r600.c):
>>> >> WREG32(VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR,
>>> >> (u32)(rdev->dummy_page.addr >> 12));
>>> >>
>>> >> On our platform, PAGE_SIZE is 16K, does it have any problem?
>>> >
>>> > No this should be handled properly.
>>> >
>>> >> Also in radeon_gart_unbind() and radeon_gart_restore(), the logic
>>> >> should change to:
>>> >>   for (j = 0; j < (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); j++, t++) {
>>> >>           radeon_gart_set_page(rdev, t, page_base);
>>> >> -         page_base += RADEON_GPU_PAGE_SIZE;
>>> >> +         if (page_base != rdev->dummy_page.addr)
>>> >> +                 page_base += RADEON_GPU_PAGE_SIZE;
>>> >>   }
>>> >> ???
>>> >
>>> > No need to do so, dummy page will be 16K too, so it's fine.
>>> Really? When CPU page is 16K and GPU page is 4k, suppose the dummy page
>>> is 0x8e004000, then there are four types of address in GART:0x8e004000,
>>> 0x8e005000, 0x8e006000, 0x8e007000. The value which written in
>>> VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is 0x8e004 (0x8e004000<<12). I
>>> don't know how VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR works, but I
>>> think 0x8e005000, 0x8e006000 and 0x8e007000 cannot be handled correctly.
>>
>> When radeon_gart_unbind initialize the gart entry to point to the dummy
>> page it's just to have something safe in the GART table.
>>
>> VM_CONTEXT0_PROTECTION_FAULT_DEFAULT_ADDR is the page address used when
>> there is a fault happening. It's like a sandbox for the mc. It doesn't
>> conflict in anyway to have gart table entry to point to same page.
>>
>> Cheers,
>> Jerome
>>
>>
>
>
> _______________________________________________
> dri-devel mailing list
> dri-devel at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel