[PATCH 1/2] drm/amdgpu: Reset IH OVERFLOW_CLEAR bit after writing rptr

Christian König christian.koenig at amd.com
Tue Jan 23 09:36:39 UTC 2024



Am 22.01.24 um 23:39 schrieb Joshua Ashton:
> [SNIP]
>>>
>>> Most work submissions in practice submit more waves than the number of
>>> wave slots the GPU has.
>>> As far as I understand soft recovery, the only thing it does is kill 
>>> all
>>> active waves. This frees up the CUs so more waves are launched, which
>>> can fault again, and that leads to potentially lots of faults for a
>>> single wave slot in the end.
>>
>> Exactly that, but killing each wave takes a moment since we do that 
>> in a loop with a bit delay in there.
>>
>> So the interrupt handler should at least in theory have time to catch 
>> up.
>
> I don't think there is any delay in that loop is there?

Mhm, looks like I remember that incorrectly.

>
>     while (!dma_fence_is_signaled(fence) &&
>            ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
>         ring->funcs->soft_recovery(ring, vmid);
>
> (soft_recovery function does not have a delay/sleep/whatever either)
>
> FWIW, two other changes we did in SteamOS to make recovery more 
> reliable on VANGOGH was:
>
> 1) Move the timeout determination after the spinlock setting the fence 
> error.

Well that should not really have any effect.

>
> 2) Raise the timeout from 0.1s to 1s.

Well that's not necessarily a good idea. If the SQ isn't able to respond 
in 100ms then I would really go into a hard reset.

Waiting one extra second is way to long here.

Regards,
Christian.

>
> - Joshie 🐸✨
>
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Regards,
>>> Friedrich
>



More information about the amd-gfx mailing list