CIK hangs with kernel 3.15, bisected

Christian König deathsimple at vodafone.de
Thu May 29 09:59:26 PDT 2014


Yeah, that will work around it for now.

But the general problem is that we have a memory corruption here, we 
just didn't noticed it earlier because clearing a texture or vectors 
with zero only results in random mis rendering.

Only when you hit a shader or in this case a page table it really 
manifests in a bad crash.

Going to dig deeper into this,
Christian.

Am 29.05.2014 18:51, schrieb Marek Olšák:
> Can disable evictions for page tables, e.g. by removing them from the LRU list?
>
> Marek
>
> On Thu, May 29, 2014 at 6:30 PM, Christian König
> <deathsimple at vodafone.de> wrote:
>> Hi Marek & Alex,
>>
>> I've found the issue why forcefully evicting page tables sometimes crashes
>> the box.
>>
>> Well this is a typical hexdump page table before it is moved to GART:
>> 000117f000  02914061 00000000
>> 000117f008  02915061 00000000
>> 000117f010  02916061 00000000
>> 000117f018  02917061 00000000
>> 000117f020  02918061 00000000
>>
>> And it looks like this when it comes back:
>> 0006102000  00000000 00000000
>> *
>>
>> Ideas? I don't really have an explanation for this. Moving buffers around
>> otherwise seems to work perfectly fine.
>>
>> Thanks,
>> Christian.
>>
>> Am 28.05.2014 12:38, schrieb Christian König:
>>
>>> I already tried a similar patch as well, without any more noticeable
>>> crashes. But going to give this another round with your patch and openarena.
>>>
>>> Thanks,
>>> Christian.
>>>
>>> Am 27.05.2014 23:55, schrieb Marek Olšák:
>>>> Hi Christian,
>>>>
>>>> I test on Bonaire (ChipID = 0x665c). Unfortunately, the hangs are not
>>>> fixed yet. They are very rare and very random. Therefore, I have come
>>>> up with a patch which evicts page tables between IBs. See the
>>>> attachment. With that patch applied, the system starts fine, compiz
>>>> and glxgears work, but once I start playing openarena, it locks up
>>>> pretty quickly.
>>>>
>>>> The patch shouldn't do anything in theory, because pages are moved
>>>> back to VRAM immediately after that. However, the VRAM address of page
>>>> tables may end up being different from before, which might be the root
>>>> cause.
>>>>
>>>> Marek
>>>>
>>>> On Wed, May 14, 2014 at 2:11 PM, Christian König
>>>> <deathsimple at vodafone.de> wrote:
>>>>> Crap, any chance you can narrow it down a bit more?
>>>>>
>>>>> I've just tried a piglit quick test on my Bonaire and it seems to work
>>>>> perfectly fine.
>>>>>
>>>>> What hw do you test on?
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 13.05.2014 23:21, schrieb Marek Olšák:
>>>>>
>>>>>> Hi Christian,
>>>>>>
>>>>>> Even though some regressions are fixed by these patches:
>>>>>>
>>>>>> drm/radeon: fix page directory update size estimation
>>>>>> drm/radeon: fix buffer placement under memory pressure v2
>>>>>>
>>>>>> and indeed, the texelFetch tests no longer hang, there is one more
>>>>>> hang which needs to be fixed. :( All I know is the exact same commit
>>>>>> causes it and it can only be reproduced by running whole piglit with
>>>>>> concurrency enabled.
>>>>>>
>>>>>> My kernel git log:
>>>>>>
>>>>>> * 2ba22c8 - drm/radeon: fix buffer placement under memory pressure v2
>>>>>> (10 hours ago) <Christian König>
>>>>>> * 3af91e5 - drm/radeon: fix page directory update size estimation (21
>>>>>> hours ago) <Christian König>
>>>>>> * 6d2f294 - drm/radeon: use normal BOs for the page tables v4 (2
>>>>>> months ago) <Christian König>
>>>>>> * fa68834 - drm/radeon: further cleanup vm flushing & fencing (2
>>>>>> months ago) <Christian König>
>>>>>>
>>>>>> fa68834 doesn't hang, but 2ba22c8 hangs, which means 6d2f294 or either
>>>>>> of the two fixes is the first bad commit.
>>>>>>
>>>>>> Marek
>>>>>>
>>>>>> On Fri, May 9, 2014 at 8:03 PM, Marek Olšák <maraeo at gmail.com> wrote:
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>> This commit which first appeared in 3.15-rc1 causes hangs on Bonaire:
>>>>>>>
>>>>>>> commit 6d2f2944e95e504a7d33385eeeb9bb7fcca72592
>>>>>>> Author: Christian König <christian.koenig at amd.com>
>>>>>>> Date:   Thu Feb 20 13:42:17 2014 +0100
>>>>>>>
>>>>>>>        drm/radeon: use normal BOs for the page tables v4
>>>>>>>
>>>>>>>        No need to make it more complicated than necessary,
>>>>>>>        just allocate the page tables as normal BO and
>>>>>>>        flush whenever the address change.
>>>>>>>
>>>>>>>        v2: update comments and function name
>>>>>>>        v3: squash bug fixes, page directory and tables patch
>>>>>>>        v4: rebased on Mareks changes
>>>>>>>
>>>>>>>        Signed-off-by: Christian König <christian.koenig at amd.com>
>>>>>>>
>>>>>>>
>>>>>>> Reverting the commit gives me a lot of merge conflicts.
>>>>>>>
>>>>>>> The simplest way to reproduce the hangs is to run piglit with these
>>>>>>> parameters:
>>>>>>> -t texelFetch.fs
>>>>>>>
>>>>>>> Some of the tests allocate a lot of MSAA textures and the tests also
>>>>>>> run in parallel, which creates a lot of memory pressure and probably
>>>>>>> causes buffer evictions.
>>>>>>>
>>>>>>> Any idea what is wrong with it?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Marek
>>>>>



More information about the dri-devel mailing list