[PATCH 1/2] drm/amdgpu: Reset IH OVERFLOW_CLEAR bit after writing rptr

Fri Feb 2 13:31:23 UTC 2024

Hi Joshie,

the first patch is already on the way upstream since that is a clear bug 
fix.

Sunil has setup a test system and contacted up with Friedrich to get his 
hands on the test application and reproduced the problem. It looks like 
that the OVERFLOW_CLEAR bit is only the tip of the iceberg of incorrect 
IH documentation and we are now nuking the HW engineers responsible for 
this block with questions.

I will also be pushing that we get an IGT tests for this and that we 
find a long term solution to not be surprised by incorrect hw 
documentation any more.

Thanks,
Christian.

Am 02.02.24 um 12:11 schrieb Joshua Ashton:
> Hello Christian,
>
> Any update on finding an upstreamable solution for this problem?
>
> Having working hang recovery is really important for us on Steam Deck, 
> and it would be nice to have an upstream solution, and not carry a 
> bunch of patches you disagree with. :P
>
> Thanks
> - Joshie 🐸✨
>
> On 1/23/24 12:49, Christian König wrote:
>> Am 23.01.24 um 12:35 schrieb Friedrich Vock:
>>> On 23.01.24 10:36, Christian König wrote:
>>>>
>>>>
>>>> Am 22.01.24 um 23:39 schrieb Joshua Ashton:
>>>>> [SNIP]
>>>>>>>
>>>>>>> Most work submissions in practice submit more waves than the 
>>>>>>> number of
>>>>>>> wave slots the GPU has.
>>>>>>> As far as I understand soft recovery, the only thing it does is
>>>>>>> kill all
>>>>>>> active waves. This frees up the CUs so more waves are launched, 
>>>>>>> which
>>>>>>> can fault again, and that leads to potentially lots of faults for a
>>>>>>> single wave slot in the end.
>>>>>>
>>>>>> Exactly that, but killing each wave takes a moment since we do that
>>>>>> in a loop with a bit delay in there.
>>>>>>
>>>>>> So the interrupt handler should at least in theory have time to
>>>>>> catch up.
>>>>>
>>>>> I don't think there is any delay in that loop is there?
>>>>
>>>> Mhm, looks like I remember that incorrectly.
>>>>
>>>>>
>>>>>     while (!dma_fence_is_signaled(fence) &&
>>>>>            ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
>>>>>         ring->funcs->soft_recovery(ring, vmid);
>>>>>
>>>>> (soft_recovery function does not have a delay/sleep/whatever either)
>>>>>
>>>>> FWIW, two other changes we did in SteamOS to make recovery more
>>>>> reliable on VANGOGH was:
>>>>>
>>>>> 1) Move the timeout determination after the spinlock setting the
>>>>> fence error.
>>>>
>>>> Well that should not really have any effect.
>>>>
>>>>>
>>>>> 2) Raise the timeout from 0.1s to 1s.
>>>>
>>>> Well that's not necessarily a good idea. If the SQ isn't able to
>>>> respond in 100ms then I would really go into a hard reset.
>>>>
>>>> Waiting one extra second is way to long here.
>>>
>>> Bumping the timeout seemed to be necessary in order to reliably
>>> soft-recover from hangs with page faults. (Being able to soft-recover
>>> from these is actually a really good thing, because if e.g. games
>>> accidentally trigger faults, it won't kill a user's entire system.)
>>
>> I still have an extremely bad feeling about that. From the 
>> discussions a wave which waits for a fault resolution can't be 
>> preempted nor killed.
>>
>> So what most likely happens is that some of the state sticks around 
>> in the hw and can only be cleared with a hard recovery.
>>
>> For the steam deck it might still be the better option but that is 
>> most likely not the best solution for every use case. It could for 
>> example be that the system doesn't have the full performance any more.
>>
>>>
>>> However, the bump I had in mind was more moderate: Currently the 
>>> timeout
>>> is 10ms (=0.01s). Bumping that to 0.1s already improves reliability
>>> enough. I agree that waiting a full second before giving up might be a
>>> bit too long.
>>
>> Well we should never have a timeout longer than we would expect a 
>> submission to be. So assuming a minimum of 10fps we should never go 
>> over 100ms or so.
>>
>> If killing the waves takes longer than the original submission would 
>> have then there is most likely some state not correctly cleared in 
>> the hw and we really have to do a hard reset to clean up.
>>
>> Regards,
>> Christian.
>>
>>>
>>> Regards,
>>> Friedrich
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> - Joshie 🐸✨
>>>>>
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Friedrich
>>>>>
>>>>
>>