[PATCH 1/2] drm/amdgpu: Reset IH OVERFLOW_CLEAR bit after writing rptr

Friedrich Vock friedrich.vock at gmx.de
Tue Jan 23 11:35:03 UTC 2024


On 23.01.24 10:36, Christian König wrote:
>
>
> Am 22.01.24 um 23:39 schrieb Joshua Ashton:
>> [SNIP]
>>>>
>>>> Most work submissions in practice submit more waves than the number of
>>>> wave slots the GPU has.
>>>> As far as I understand soft recovery, the only thing it does is
>>>> kill all
>>>> active waves. This frees up the CUs so more waves are launched, which
>>>> can fault again, and that leads to potentially lots of faults for a
>>>> single wave slot in the end.
>>>
>>> Exactly that, but killing each wave takes a moment since we do that
>>> in a loop with a bit delay in there.
>>>
>>> So the interrupt handler should at least in theory have time to
>>> catch up.
>>
>> I don't think there is any delay in that loop is there?
>
> Mhm, looks like I remember that incorrectly.
>
>>
>>     while (!dma_fence_is_signaled(fence) &&
>>            ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
>>         ring->funcs->soft_recovery(ring, vmid);
>>
>> (soft_recovery function does not have a delay/sleep/whatever either)
>>
>> FWIW, two other changes we did in SteamOS to make recovery more
>> reliable on VANGOGH was:
>>
>> 1) Move the timeout determination after the spinlock setting the
>> fence error.
>
> Well that should not really have any effect.
>
>>
>> 2) Raise the timeout from 0.1s to 1s.
>
> Well that's not necessarily a good idea. If the SQ isn't able to
> respond in 100ms then I would really go into a hard reset.
>
> Waiting one extra second is way to long here.

Bumping the timeout seemed to be necessary in order to reliably
soft-recover from hangs with page faults. (Being able to soft-recover
from these is actually a really good thing, because if e.g. games
accidentally trigger faults, it won't kill a user's entire system.)

However, the bump I had in mind was more moderate: Currently the timeout
is 10ms (=0.01s). Bumping that to 0.1s already improves reliability
enough. I agree that waiting a full second before giving up might be a
bit too long.

Regards,
Friedrich

>
> Regards,
> Christian.
>
>>
>> - Joshie 🐸✨
>>
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Regards,
>>>> Friedrich
>>
>


More information about the amd-gfx mailing list