[PATCH 1/2] drm/amdgpu: Reset IH OVERFLOW_CLEAR bit after writing rptr

Mon Jan 22 22:39:07 UTC 2024

On 1/22/24 13:35, Christian König wrote:
> Am 22.01.24 um 11:45 schrieb Friedrich Vock:
>> On 22.01.24 11:21, Friedrich Vock wrote:
>>> On 22.01.24 11:10, Christian König wrote:
>>>> Am 19.01.24 um 20:18 schrieb Felix Kuehling:
>>>>> On 2024-01-18 07:07, Christian König wrote:
>>>>>> Am 18.01.24 um 00:44 schrieb Friedrich Vock:
>>>>>>> On 18.01.24 00:00, Alex Deucher wrote:
>>>>>>> [SNIP]
>>>>>>> No, amdgpu.noretry=1 does not change anything.
>>>>>>
>>>>>> Well the good news first the hw engineer answered rather quickly.
>>>>>> The bad news is that the hardware really doesn't work as documented
>>>>>> in multiple ways.
>>>>>>
>>>>>> First of all the CLEAR bit is a level and not a trigger, so the
>>>>>> intention to clear it is indeed correct. For now please modify this
>>>>>> patch so that the CLEAR bit is set and cleared directly after
>>>>>> setting it, this way we should be able to detect further overflows
>>>>>> immediately.
>>>>>>
>>>>>> Then the APU the Steam Deck uses simply doesn't have the filter
>>>>>> function for page faults in the hardware, the really bad news is it
>>>>>> also doesn't have the extra IH rings where we could re-route the
>>>>>> faults to prevent overflows.
>>>>>>
>>>>>> That full explains the behavior you have been seeing, but doesn't
>>>>>> really provide a doable solution to mitigate this problem.
>>>>>>
>>>>>> I'm going to dig deeper into the hw documentation and specification
>>>>>> to see if we can use a different feature to avoid the overflow.
>>>>>
>>>>> If we're not enabling retry faults, then each wave front should
>>>>> generate at most one fault. You should be able to avoid overflows by
>>>>> making the IH ring large enough to accommodate one fault per wave
>>>>> front.
>>>>
>>>> That is the exact same argument our HW engineers came up with when we
>>>> asked why the APU is missing all those nice IH ring overflow avoidance
>>>> features the dGPUs have :)
>>>>
>>> I can reproduce IH overflows on my RX 6700 XT dGPU as well FWIW.
> 
> Interesting data point. We have probably looked to much into the faults 
> on MI* products and never checked Navi.
> 
> Can you try to just setting WPTR_OVERFLOW_ENABLE to 0? At least in 
> theory that should disable IH overflows altogether on Navi without 
> causing loss of IVs.
> 
>>>
>>>> The only problem with this approach is that on Navi when a wave is
>>>> blocked by waiting on a fault you can't kill it using soft recovery
>>>> any more (at least when my understanding is correct).
>>>>
>>> Killing page-faulted waves via soft recovery works. From my testing on
>>> Deck, it seems to take a bit of time, but if you try for long enough
>>> soft recovery eventually succeeds.
> 
> Ok that is massively strange. We had tons of discussions about that 
> shader can't be interrupted while they wait for a fault on Navi.
> 
> Maybe killing them is still possible, need to double check that as well.
> 
>>
>>
>> On second thought, could it be that this is the critical flaw in the "at
>> most one fault per wave" thinking?
> 
> Well completely agree that this. That rational to leave out the new IH 
> features on APUs is rather weak.
> 
>>
>> Most work submissions in practice submit more waves than the number of
>> wave slots the GPU has.
>> As far as I understand soft recovery, the only thing it does is kill all
>> active waves. This frees up the CUs so more waves are launched, which
>> can fault again, and that leads to potentially lots of faults for a
>> single wave slot in the end.
> 
> Exactly that, but killing each wave takes a moment since we do that in a 
> loop with a bit delay in there.
> 
> So the interrupt handler should at least in theory have time to catch up.

I don't think there is any delay in that loop is there?

	while (!dma_fence_is_signaled(fence) &&
	       ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
		ring->funcs->soft_recovery(ring, vmid);

(soft_recovery function does not have a delay/sleep/whatever either)

FWIW, two other changes we did in SteamOS to make recovery more reliable 
on VANGOGH was:

1) Move the timeout determination after the spinlock setting the fence 
error.

2) Raise the timeout from 0.1s to 1s.

- Joshie 🐸✨

> 
> Regards,
> Christian.
> 
>>
>> Regards,
>> Friedrich