[PATCH 1/2] drm/amdgpu: Reset IH OVERFLOW_CLEAR bit after writing rptr
Christian König
christian.koenig at amd.com
Tue Jan 23 09:36:39 UTC 2024
Am 22.01.24 um 23:39 schrieb Joshua Ashton:
> [SNIP]
>>>
>>> Most work submissions in practice submit more waves than the number of
>>> wave slots the GPU has.
>>> As far as I understand soft recovery, the only thing it does is kill
>>> all
>>> active waves. This frees up the CUs so more waves are launched, which
>>> can fault again, and that leads to potentially lots of faults for a
>>> single wave slot in the end.
>>
>> Exactly that, but killing each wave takes a moment since we do that
>> in a loop with a bit delay in there.
>>
>> So the interrupt handler should at least in theory have time to catch
>> up.
>
> I don't think there is any delay in that loop is there?
Mhm, looks like I remember that incorrectly.
>
> while (!dma_fence_is_signaled(fence) &&
> ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
> ring->funcs->soft_recovery(ring, vmid);
>
> (soft_recovery function does not have a delay/sleep/whatever either)
>
> FWIW, two other changes we did in SteamOS to make recovery more
> reliable on VANGOGH was:
>
> 1) Move the timeout determination after the spinlock setting the fence
> error.
Well that should not really have any effect.
>
> 2) Raise the timeout from 0.1s to 1s.
Well that's not necessarily a good idea. If the SQ isn't able to respond
in 100ms then I would really go into a hard reset.
Waiting one extra second is way to long here.
Regards,
Christian.
>
> - Joshie 🐸✨
>
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Regards,
>>> Friedrich
>
More information about the amd-gfx
mailing list