[PATCH 1/2] drm/amdgpu: Reset IH OVERFLOW_CLEAR bit after writing rptr

Fri Feb 2 11:11:25 UTC 2024

Hello Christian,

Any update on finding an upstreamable solution for this problem?

Having working hang recovery is really important for us on Steam Deck, 
and it would be nice to have an upstream solution, and not carry a bunch 
of patches you disagree with. :P

Thanks
- Joshie 🐸✨

On 1/23/24 12:49, Christian König wrote:
> Am 23.01.24 um 12:35 schrieb Friedrich Vock:
>> On 23.01.24 10:36, Christian König wrote:
>>>
>>>
>>> Am 22.01.24 um 23:39 schrieb Joshua Ashton:
>>>> [SNIP]
>>>>>>
>>>>>> Most work submissions in practice submit more waves than the 
>>>>>> number of
>>>>>> wave slots the GPU has.
>>>>>> As far as I understand soft recovery, the only thing it does is
>>>>>> kill all
>>>>>> active waves. This frees up the CUs so more waves are launched, which
>>>>>> can fault again, and that leads to potentially lots of faults for a
>>>>>> single wave slot in the end.
>>>>>
>>>>> Exactly that, but killing each wave takes a moment since we do that
>>>>> in a loop with a bit delay in there.
>>>>>
>>>>> So the interrupt handler should at least in theory have time to
>>>>> catch up.
>>>>
>>>> I don't think there is any delay in that loop is there?
>>>
>>> Mhm, looks like I remember that incorrectly.
>>>
>>>>
>>>>     while (!dma_fence_is_signaled(fence) &&
>>>>            ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
>>>>         ring->funcs->soft_recovery(ring, vmid);
>>>>
>>>> (soft_recovery function does not have a delay/sleep/whatever either)
>>>>
>>>> FWIW, two other changes we did in SteamOS to make recovery more
>>>> reliable on VANGOGH was:
>>>>
>>>> 1) Move the timeout determination after the spinlock setting the
>>>> fence error.
>>>
>>> Well that should not really have any effect.
>>>
>>>>
>>>> 2) Raise the timeout from 0.1s to 1s.
>>>
>>> Well that's not necessarily a good idea. If the SQ isn't able to
>>> respond in 100ms then I would really go into a hard reset.
>>>
>>> Waiting one extra second is way to long here.
>>
>> Bumping the timeout seemed to be necessary in order to reliably
>> soft-recover from hangs with page faults. (Being able to soft-recover
>> from these is actually a really good thing, because if e.g. games
>> accidentally trigger faults, it won't kill a user's entire system.)
> 
> I still have an extremely bad feeling about that. From the discussions a 
> wave which waits for a fault resolution can't be preempted nor killed.
> 
> So what most likely happens is that some of the state sticks around in 
> the hw and can only be cleared with a hard recovery.
> 
> For the steam deck it might still be the better option but that is most 
> likely not the best solution for every use case. It could for example be 
> that the system doesn't have the full performance any more.
> 
>>
>> However, the bump I had in mind was more moderate: Currently the timeout
>> is 10ms (=0.01s). Bumping that to 0.1s already improves reliability
>> enough. I agree that waiting a full second before giving up might be a
>> bit too long.
> 
> Well we should never have a timeout longer than we would expect a 
> submission to be. So assuming a minimum of 10fps we should never go over 
> 100ms or so.
> 
> If killing the waves takes longer than the original submission would 
> have then there is most likely some state not correctly cleared in the 
> hw and we really have to do a hard reset to clean up.
> 
> Regards,
> Christian.
> 
>>
>> Regards,
>> Friedrich
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> - Joshie 🐸✨
>>>>
>>>>
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Friedrich
>>>>
>>>
>