[PATCH 1/2] drm/amdgpu: Reset IH OVERFLOW_CLEAR bit after writing rptr

Christian König christian.koenig at amd.com
Mon Jan 22 13:35:59 UTC 2024


Am 22.01.24 um 11:45 schrieb Friedrich Vock:
> On 22.01.24 11:21, Friedrich Vock wrote:
>> On 22.01.24 11:10, Christian König wrote:
>>> Am 19.01.24 um 20:18 schrieb Felix Kuehling:
>>>> On 2024-01-18 07:07, Christian König wrote:
>>>>> Am 18.01.24 um 00:44 schrieb Friedrich Vock:
>>>>>> On 18.01.24 00:00, Alex Deucher wrote:
>>>>>> [SNIP]
>>>>>> No, amdgpu.noretry=1 does not change anything.
>>>>>
>>>>> Well the good news first the hw engineer answered rather quickly.
>>>>> The bad news is that the hardware really doesn't work as documented
>>>>> in multiple ways.
>>>>>
>>>>> First of all the CLEAR bit is a level and not a trigger, so the
>>>>> intention to clear it is indeed correct. For now please modify this
>>>>> patch so that the CLEAR bit is set and cleared directly after
>>>>> setting it, this way we should be able to detect further overflows
>>>>> immediately.
>>>>>
>>>>> Then the APU the Steam Deck uses simply doesn't have the filter
>>>>> function for page faults in the hardware, the really bad news is it
>>>>> also doesn't have the extra IH rings where we could re-route the
>>>>> faults to prevent overflows.
>>>>>
>>>>> That full explains the behavior you have been seeing, but doesn't
>>>>> really provide a doable solution to mitigate this problem.
>>>>>
>>>>> I'm going to dig deeper into the hw documentation and specification
>>>>> to see if we can use a different feature to avoid the overflow.
>>>>
>>>> If we're not enabling retry faults, then each wave front should
>>>> generate at most one fault. You should be able to avoid overflows by
>>>> making the IH ring large enough to accommodate one fault per wave
>>>> front.
>>>
>>> That is the exact same argument our HW engineers came up with when we
>>> asked why the APU is missing all those nice IH ring overflow avoidance
>>> features the dGPUs have :)
>>>
>> I can reproduce IH overflows on my RX 6700 XT dGPU as well FWIW.

Interesting data point. We have probably looked to much into the faults 
on MI* products and never checked Navi.

Can you try to just setting WPTR_OVERFLOW_ENABLE to 0? At least in 
theory that should disable IH overflows altogether on Navi without 
causing loss of IVs.

>>
>>> The only problem with this approach is that on Navi when a wave is
>>> blocked by waiting on a fault you can't kill it using soft recovery
>>> any more (at least when my understanding is correct).
>>>
>> Killing page-faulted waves via soft recovery works. From my testing on
>> Deck, it seems to take a bit of time, but if you try for long enough
>> soft recovery eventually succeeds.

Ok that is massively strange. We had tons of discussions about that 
shader can't be interrupted while they wait for a fault on Navi.

Maybe killing them is still possible, need to double check that as well.

>
>
> On second thought, could it be that this is the critical flaw in the "at
> most one fault per wave" thinking?

Well completely agree that this. That rational to leave out the new IH 
features on APUs is rather weak.

>
> Most work submissions in practice submit more waves than the number of
> wave slots the GPU has.
> As far as I understand soft recovery, the only thing it does is kill all
> active waves. This frees up the CUs so more waves are launched, which
> can fault again, and that leads to potentially lots of faults for a
> single wave slot in the end.

Exactly that, but killing each wave takes a moment since we do that in a 
loop with a bit delay in there.

So the interrupt handler should at least in theory have time to catch up.

Regards,
Christian.

>
> Regards,
> Friedrich


More information about the amd-gfx mailing list