[PATCH 1/2] drm/amdgpu: Reset IH OVERFLOW_CLEAR bit after writing rptr

Mon Jan 22 10:45:31 UTC 2024

On 22.01.24 11:21, Friedrich Vock wrote:
> On 22.01.24 11:10, Christian König wrote:
>> Am 19.01.24 um 20:18 schrieb Felix Kuehling:
>>> On 2024-01-18 07:07, Christian König wrote:
>>>> Am 18.01.24 um 00:44 schrieb Friedrich Vock:
>>>>> On 18.01.24 00:00, Alex Deucher wrote:
>>>>>> [SNIP]
>>>>>>>>>> Right now, IH overflows, even if they occur repeatedly, only get
>>>>>>>>>> registered once. If not registering IH overflows can trivially
>>>>>>>>>> lead to
>>>>>>>>>> system crashes, it's amdgpu's current handling that is broken.
>>>>>>>>> It's years that we last tested this but according to the HW
>>>>>>>>> documentation this should work fine.
>>>>>>>>>
>>>>>>>>> What could potentially happen is that the IH has silenced the
>>>>>>>>> source
>>>>>>>>> of the overflow. We never implemented resetting those, but in
>>>>>>>>> this
>>>>>>>>> case that here won't help either.
>>>>>>>>>
>>>>>>>> If the IH silenced the page faults (which quite clearly cause the
>>>>>>>> overflow here), then how are the page faults still logged in
>>>>>>>> dmesg?
>>>>>>> There should be a hardware rate limit for the page faults, e.g.
>>>>>>> there
>>>>>>> can only be X faults reported in N clock cycles and then a delay is
>>>>>>> inserted.
>>>>>> @Christian Koenig  Is that tied to xnack (i.e., noretry)? The
>>>>>> default
>>>>>> is noretry=1 on gfx10.3 and newer.  But it can be overridden. It was
>>>>>> not set on some older kernels, maybe that is the problem? @Friedrich
>>>>>> Vock does setting amdgpu.noretry=1 fix the issue?
>>>>>
>>>>>
>>>>> No, amdgpu.noretry=1 does not change anything.
>>>>
>>>> Well the good news first the hw engineer answered rather quickly.
>>>> The bad news is that the hardware really doesn't work as documented
>>>> in multiple ways.
>>>>
>>>> First of all the CLEAR bit is a level and not a trigger, so the
>>>> intention to clear it is indeed correct. For now please modify this
>>>> patch so that the CLEAR bit is set and cleared directly after
>>>> setting it, this way we should be able to detect further overflows
>>>> immediately.
>>>>
>>>> Then the APU the Steam Deck uses simply doesn't have the filter
>>>> function for page faults in the hardware, the really bad news is it
>>>> also doesn't have the extra IH rings where we could re-route the
>>>> faults to prevent overflows.
>>>>
>>>> That full explains the behavior you have been seeing, but doesn't
>>>> really provide a doable solution to mitigate this problem.
>>>>
>>>> I'm going to dig deeper into the hw documentation and specification
>>>> to see if we can use a different feature to avoid the overflow.
>>>
>>> If we're not enabling retry faults, then each wave front should
>>> generate at most one fault. You should be able to avoid overflows by
>>> making the IH ring large enough to accommodate one fault per wave
>>> front.
>>
>> That is the exact same argument our HW engineers came up with when we
>> asked why the APU is missing all those nice IH ring overflow avoidance
>> features the dGPUs have :)
>>
> I can reproduce IH overflows on my RX 6700 XT dGPU as well FWIW.
>
>> The only problem with this approach is that on Navi when a wave is
>> blocked by waiting on a fault you can't kill it using soft recovery
>> any more (at least when my understanding is correct).
>>
> Killing page-faulted waves via soft recovery works. From my testing on
> Deck, it seems to take a bit of time, but if you try for long enough
> soft recovery eventually succeeds.

On second thought, could it be that this is the critical flaw in the "at
most one fault per wave" thinking?

Most work submissions in practice submit more waves than the number of
wave slots the GPU has.
As far as I understand soft recovery, the only thing it does is kill all
active waves. This frees up the CUs so more waves are launched, which
can fault again, and that leads to potentially lots of faults for a
single wave slot in the end.

Regards,
Friedrich

>
> Regards,
> Friedrich
>
>>>
>>> If the faults are coming from SDMA, that may be another problem. I'm
>>> not sure whether it can generate multiple no-retry faults from the
>>> same queue.
>>
>> Regarding faults the SDMA is relatively harmless compared to the 3D
>> engine, IIRC the resolve queue is something like 128 entries deep. So
>> you never see more than those 128 faults if I'm not completely mistaken.
>>
>> Sunil is setting up a test system for this in an AMD lab and will play
>> around with a few HW features to mitigate the issue. I still hope that
>> we can completely avoid the overflow altogether.
>>
>> Regards,
>> Christian.
>>
>>>
>>> Regards,
>>>   Felix
>>>
>>>
>>>>
>>>> Thanks,
>>>> Christian.
>>>>
>>>>>
>>>>> Regards,
>>>>> Friedrich
>>>>>
>>>>>> Alex
>>>>>>
>>>>>>>>>> The possibility of a repeated IH overflow in between reading
>>>>>>>>>> the wptr
>>>>>>>>>> and updating the rptr is a good point, but how can we detect
>>>>>>>>>> that at
>>>>>>>>>> all? It seems to me like we can't set the OVERFLOW_CLEAR bit
>>>>>>>>>> at all
>>>>>>>>>> then, because we're guaranteed to miss any overflows that
>>>>>>>>>> happen while
>>>>>>>>>> the bit is set.
>>>>>>>>> When an IH overflow is signaled we clear that flag by writing 1
>>>>>>>>> into
>>>>>>>>> the OVERFLOW_CLEAR bit and skip one entry in the IH ring buffer.
>>>>>>>>>
>>>>>>>>> What can of course happen is that the IH ring buffer overflows
>>>>>>>>> more
>>>>>>>>> than this single entry and we process IVs which are potentially
>>>>>>>>> corrupted, but we won't miss any additional overflows since we
>>>>>>>>> only
>>>>>>>>> start processing after resetting the flag.
>>>>>>>>>
>>>>>>>>> An IH overflow is also something you should *never* see in a
>>>>>>>>> production system. This is purely for driver bringup and as
>>>>>>>>> fallback
>>>>>>>>> when there is a severe incorrect programming of the HW.
>>>>>>>>>
>>>>>>>>> The only exception of that is page fault handling on MI products
>>>>>>>>> because of a hardware bug, to mitigate this we are processing
>>>>>>>>> page
>>>>>>>>> faults on a separate IH ring on those parts.
>>>>>>>>>
>>>>>>>>> On all other hw generations the IH should have some rate limit
>>>>>>>>> for the
>>>>>>>>> number of faults generated per second, so that the CPU is
>>>>>>>>> always able
>>>>>>>>> to catch up.
>>>>>>>> I'm wondering if there is another bug in here somewhere. Your
>>>>>>>> explanation of how it's supposed to work makes a lot of sense,
>>>>>>>> but from
>>>>>>>> what I can tell it doesn't work that way when I test it.
>>>>>>>>
>>>>>>>>  From the printk_ratelimit stats it would seem like >2000 faults
>>>>>>>> arrive
>>>>>>>> in less than a second, so perhaps your theory about fault
>>>>>>>> interrupt
>>>>>>>> ratelimiting not working is correct (but it's hard for me to
>>>>>>>> verify what
>>>>>>>> is going on without the documentation).
>>>>>>> I'm going to ping the relevant engineer and putting someone on
>>>>>>> the task
>>>>>>> to take a look.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Friedrich
>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Friedrich
>>>>>>>>>>
>>>>>>>>>>> When you clear the overflow again when updating the RPTR you
>>>>>>>>>>> could
>>>>>>>>>>> loose another overflow which might have happened in between
>>>>>>>>>>> and so
>>>>>>>>>>> potentially process corrupted IVs.
>>>>>>>>>>>
>>>>>>>>>>> That can trivially crash the system.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>>>    }
>>>>>>>>>>>>
>>>>>>>>>>>>    static int cik_ih_early_init(void *handle)
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/cz_ih.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
>>>>>>>>>>>> index b8c47e0cf37a..076559668573 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/cz_ih.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
>>>>>>>>>>>> @@ -215,7 +215,7 @@ static u32 cz_ih_get_wptr(struct
>>>>>>>>>>>> amdgpu_device
>>>>>>>>>>>> *adev,
>>>>>>>>>>>>        tmp = RREG32(mmIH_RB_CNTL);
>>>>>>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>>> WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>>>>>>        WREG32(mmIH_RB_CNTL, tmp);
>>>>>>>>>>>> -
>>>>>>>>>>>> +    ih->overflow = true;
>>>>>>>>>>>>
>>>>>>>>>>>>    out:
>>>>>>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>>>>>>> @@ -266,7 +266,19 @@ static void cz_ih_decode_iv(struct
>>>>>>>>>>>> amdgpu_device
>>>>>>>>>>>> *adev,
>>>>>>>>>>>>    static void cz_ih_set_rptr(struct amdgpu_device *adev,
>>>>>>>>>>>>                   struct amdgpu_ih_ring *ih)
>>>>>>>>>>>>    {
>>>>>>>>>>>> +    u32 tmp;
>>>>>>>>>>>> +
>>>>>>>>>>>>        WREG32(mmIH_RB_RPTR, ih->rptr);
>>>>>>>>>>>> +
>>>>>>>>>>>> +    /* If we overflowed previously (and thus set the
>>>>>>>>>>>> OVERFLOW_CLEAR
>>>>>>>>>>>> bit),
>>>>>>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>>>>>>> +     */
>>>>>>>>>>>> +    if (ih->overflow) {
>>>>>>>>>>>> +        tmp = RREG32(mmIH_RB_CNTL);
>>>>>>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>>> WPTR_OVERFLOW_CLEAR,
>>>>>>>>>>>> 0);
>>>>>>>>>>>> +        WREG32(mmIH_RB_CNTL, tmp);
>>>>>>>>>>>> +        ih->overflow = false;
>>>>>>>>>>>> +    }
>>>>>>>>>>>>    }
>>>>>>>>>>>>
>>>>>>>>>>>>    static int cz_ih_early_init(void *handle)
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
>>>>>>>>>>>> index aecad530b10a..1a5e668643d1 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
>>>>>>>>>>>> @@ -214,7 +214,7 @@ static u32 iceland_ih_get_wptr(struct
>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>        tmp = RREG32(mmIH_RB_CNTL);
>>>>>>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>>> WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>>>>>>        WREG32(mmIH_RB_CNTL, tmp);
>>>>>>>>>>>> -
>>>>>>>>>>>> +    ih->overflow = true;
>>>>>>>>>>>>
>>>>>>>>>>>>    out:
>>>>>>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>>>>>>> @@ -265,7 +265,19 @@ static void iceland_ih_decode_iv(struct
>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>    static void iceland_ih_set_rptr(struct amdgpu_device *adev,
>>>>>>>>>>>>                    struct amdgpu_ih_ring *ih)
>>>>>>>>>>>>    {
>>>>>>>>>>>> +    u32 tmp;
>>>>>>>>>>>> +
>>>>>>>>>>>>        WREG32(mmIH_RB_RPTR, ih->rptr);
>>>>>>>>>>>> +
>>>>>>>>>>>> +    /* If we overflowed previously (and thus set the
>>>>>>>>>>>> OVERFLOW_CLEAR
>>>>>>>>>>>> bit),
>>>>>>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>>>>>>> +     */
>>>>>>>>>>>> +    if (ih->overflow) {
>>>>>>>>>>>> +        tmp = RREG32(mmIH_RB_CNTL);
>>>>>>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>>> WPTR_OVERFLOW_CLEAR,
>>>>>>>>>>>> 0);
>>>>>>>>>>>> +        WREG32(mmIH_RB_CNTL, tmp);
>>>>>>>>>>>> +        ih->overflow = false;
>>>>>>>>>>>> +    }
>>>>>>>>>>>>    }
>>>>>>>>>>>>
>>>>>>>>>>>>    static int iceland_ih_early_init(void *handle)
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
>>>>>>>>>>>> index d9ed7332d805..ce8f7feec713 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
>>>>>>>>>>>> @@ -418,6 +418,8 @@ static u32 ih_v6_0_get_wptr(struct
>>>>>>>>>>>> amdgpu_device
>>>>>>>>>>>> *adev,
>>>>>>>>>>>>        tmp = RREG32_NO_KIQ(ih_regs->ih_rb_cntl);
>>>>>>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>>> WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>>>>>>        WREG32_NO_KIQ(ih_regs->ih_rb_cntl, tmp);
>>>>>>>>>>>> +    ih->overflow = true;
>>>>>>>>>>>> +
>>>>>>>>>>>>    out:
>>>>>>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>>>>>>>    }
>>>>>>>>>>>> @@ -459,6 +461,7 @@ static void ih_v6_0_irq_rearm(struct
>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>    static void ih_v6_0_set_rptr(struct amdgpu_device *adev,
>>>>>>>>>>>>                       struct amdgpu_ih_ring *ih)
>>>>>>>>>>>>    {
>>>>>>>>>>>> +    u32 tmp;
>>>>>>>>>>>>        struct amdgpu_ih_regs *ih_regs;
>>>>>>>>>>>>
>>>>>>>>>>>>        if (ih->use_doorbell) {
>>>>>>>>>>>> @@ -472,6 +475,16 @@ static void ih_v6_0_set_rptr(struct
>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>            ih_regs = &ih->ih_regs;
>>>>>>>>>>>>            WREG32(ih_regs->ih_rb_rptr, ih->rptr);
>>>>>>>>>>>>        }
>>>>>>>>>>>> +
>>>>>>>>>>>> +    /* If we overflowed previously (and thus set the
>>>>>>>>>>>> OVERFLOW_CLEAR
>>>>>>>>>>>> bit),
>>>>>>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>>>>>>> +     */
>>>>>>>>>>>> +    if (ih->overflow) {
>>>>>>>>>>>> +        tmp = RREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl);
>>>>>>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>>> WPTR_OVERFLOW_CLEAR,
>>>>>>>>>>>> 0);
>>>>>>>>>>>> + WREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl, tmp);
>>>>>>>>>>>> +        ih->overflow = false;
>>>>>>>>>>>> +    }
>>>>>>>>>>>>    }
>>>>>>>>>>>>
>>>>>>>>>>>>    /**
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/ih_v6_1.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/ih_v6_1.c
>>>>>>>>>>>> index 8fb05eae340a..668788ad34d9 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/ih_v6_1.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/ih_v6_1.c
>>>>>>>>>>>> @@ -418,6 +418,8 @@ static u32 ih_v6_1_get_wptr(struct
>>>>>>>>>>>> amdgpu_device
>>>>>>>>>>>> *adev,
>>>>>>>>>>>>        tmp = RREG32_NO_KIQ(ih_regs->ih_rb_cntl);
>>>>>>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>>> WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>>>>>>        WREG32_NO_KIQ(ih_regs->ih_rb_cntl, tmp);
>>>>>>>>>>>> +    ih->overflow = true;
>>>>>>>>>>>> +
>>>>>>>>>>>>    out:
>>>>>>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>>>>>>>    }
>>>>>>>>>>>> @@ -459,6 +461,7 @@ static void ih_v6_1_irq_rearm(struct
>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>    static void ih_v6_1_set_rptr(struct amdgpu_device *adev,
>>>>>>>>>>>>                       struct amdgpu_ih_ring *ih)
>>>>>>>>>>>>    {
>>>>>>>>>>>> +    u32 tmp;
>>>>>>>>>>>>        struct amdgpu_ih_regs *ih_regs;
>>>>>>>>>>>>
>>>>>>>>>>>>        if (ih->use_doorbell) {
>>>>>>>>>>>> @@ -472,6 +475,16 @@ static void ih_v6_1_set_rptr(struct
>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>            ih_regs = &ih->ih_regs;
>>>>>>>>>>>>            WREG32(ih_regs->ih_rb_rptr, ih->rptr);
>>>>>>>>>>>>        }
>>>>>>>>>>>> +
>>>>>>>>>>>> +    /* If we overflowed previously (and thus set the
>>>>>>>>>>>> OVERFLOW_CLEAR
>>>>>>>>>>>> bit),
>>>>>>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>>>>>>> +     */
>>>>>>>>>>>> +    if (ih->overflow) {
>>>>>>>>>>>> +        tmp = RREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl);
>>>>>>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>>> WPTR_OVERFLOW_CLEAR,
>>>>>>>>>>>> 0);
>>>>>>>>>>>> + WREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl, tmp);
>>>>>>>>>>>> +        ih->overflow = false;
>>>>>>>>>>>> +    }
>>>>>>>>>>>>    }
>>>>>>>>>>>>
>>>>>>>>>>>>    /**
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>>>>>>>>>>>> index e64b33115848..0bdac923cb4d 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>>>>>>>>>>>> @@ -442,6 +442,7 @@ static u32 navi10_ih_get_wptr(struct
>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>        tmp = RREG32_NO_KIQ(ih_regs->ih_rb_cntl);
>>>>>>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>>> WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>>>>>>        WREG32_NO_KIQ(ih_regs->ih_rb_cntl, tmp);
>>>>>>>>>>>> +    ih->overflow = true;
>>>>>>>>>>>>    out:
>>>>>>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>>>>>>>    }
>>>>>>>>>>>> @@ -483,6 +484,7 @@ static void navi10_ih_irq_rearm(struct
>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>    static void navi10_ih_set_rptr(struct amdgpu_device *adev,
>>>>>>>>>>>>                       struct amdgpu_ih_ring *ih)
>>>>>>>>>>>>    {
>>>>>>>>>>>> +    u32 tmp;
>>>>>>>>>>>>        struct amdgpu_ih_regs *ih_regs;
>>>>>>>>>>>>
>>>>>>>>>>>>        if (ih == &adev->irq.ih_soft)
>>>>>>>>>>>> @@ -499,6 +501,16 @@ static void navi10_ih_set_rptr(struct
>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>            ih_regs = &ih->ih_regs;
>>>>>>>>>>>>            WREG32(ih_regs->ih_rb_rptr, ih->rptr);
>>>>>>>>>>>>        }
>>>>>>>>>>>> +
>>>>>>>>>>>> +    /* If we overflowed previously (and thus set the
>>>>>>>>>>>> OVERFLOW_CLEAR
>>>>>>>>>>>> bit),
>>>>>>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>>>>>>> +     */
>>>>>>>>>>>> +    if (ih->overflow) {
>>>>>>>>>>>> +        tmp = RREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl);
>>>>>>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>>> WPTR_OVERFLOW_CLEAR,
>>>>>>>>>>>> 0);
>>>>>>>>>>>> + WREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl, tmp);
>>>>>>>>>>>> +        ih->overflow = false;
>>>>>>>>>>>> +    }
>>>>>>>>>>>>    }
>>>>>>>>>>>>
>>>>>>>>>>>>    /**
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/si_ih.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/si_ih.c
>>>>>>>>>>>> index 9a24f17a5750..ff35056d2b54 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/si_ih.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/si_ih.c
>>>>>>>>>>>> @@ -119,6 +119,7 @@ static u32 si_ih_get_wptr(struct
>>>>>>>>>>>> amdgpu_device
>>>>>>>>>>>> *adev,
>>>>>>>>>>>>            tmp = RREG32(IH_RB_CNTL);
>>>>>>>>>>>>            tmp |= IH_RB_CNTL__WPTR_OVERFLOW_CLEAR_MASK;
>>>>>>>>>>>>            WREG32(IH_RB_CNTL, tmp);
>>>>>>>>>>>> +        ih->overflow = true;
>>>>>>>>>>>>        }
>>>>>>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>>>>>>>    }
>>>>>>>>>>>> @@ -147,7 +148,18 @@ static void si_ih_decode_iv(struct
>>>>>>>>>>>> amdgpu_device
>>>>>>>>>>>> *adev,
>>>>>>>>>>>>    static void si_ih_set_rptr(struct amdgpu_device *adev,
>>>>>>>>>>>>                   struct amdgpu_ih_ring *ih)
>>>>>>>>>>>>    {
>>>>>>>>>>>> +    u32 tmp;
>>>>>>>>>>>> +
>>>>>>>>>>>>        WREG32(IH_RB_RPTR, ih->rptr);
>>>>>>>>>>>> +
>>>>>>>>>>>> +    /* If we overflowed previously (and thus set the
>>>>>>>>>>>> OVERFLOW_CLEAR
>>>>>>>>>>>> bit),
>>>>>>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>>>>>>> +     */
>>>>>>>>>>>> +    if (ih->overflow) {
>>>>>>>>>>>> +        tmp = RREG32(IH_RB_CNTL);
>>>>>>>>>>>> +        tmp &= ~IH_RB_CNTL__WPTR_OVERFLOW_CLEAR_MASK;
>>>>>>>>>>>> +        WREG32(IH_RB_CNTL, tmp);
>>>>>>>>>>>> +    }
>>>>>>>>>>>>    }
>>>>>>>>>>>>
>>>>>>>>>>>>    static int si_ih_early_init(void *handle)
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
>>>>>>>>>>>> index 917707bba7f3..6f5090d3db48 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
>>>>>>>>>>>> @@ -218,6 +218,7 @@ static u32 tonga_ih_get_wptr(struct
>>>>>>>>>>>> amdgpu_device
>>>>>>>>>>>> *adev,
>>>>>>>>>>>>        tmp = RREG32(mmIH_RB_CNTL);
>>>>>>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>>> WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>>>>>>        WREG32(mmIH_RB_CNTL, tmp);
>>>>>>>>>>>> +    ih->overflow = true;
>>>>>>>>>>>>
>>>>>>>>>>>>    out:
>>>>>>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>>>>>>> @@ -268,6 +269,8 @@ static void tonga_ih_decode_iv(struct
>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>    static void tonga_ih_set_rptr(struct amdgpu_device *adev,
>>>>>>>>>>>>                      struct amdgpu_ih_ring *ih)
>>>>>>>>>>>>    {
>>>>>>>>>>>> +    u32 tmp;
>>>>>>>>>>>> +
>>>>>>>>>>>>        if (ih->use_doorbell) {
>>>>>>>>>>>>            /* XXX check if swapping is necessary on BE */
>>>>>>>>>>>>            *ih->rptr_cpu = ih->rptr;
>>>>>>>>>>>> @@ -275,6 +278,16 @@ static void tonga_ih_set_rptr(struct
>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>        } else {
>>>>>>>>>>>>            WREG32(mmIH_RB_RPTR, ih->rptr);
>>>>>>>>>>>>        }
>>>>>>>>>>>> +
>>>>>>>>>>>> +    /* If we overflowed previously (and thus set the
>>>>>>>>>>>> OVERFLOW_CLEAR
>>>>>>>>>>>> bit),
>>>>>>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>>>>>>> +     */
>>>>>>>>>>>> +    if (ih->overflow) {
>>>>>>>>>>>> +        tmp = RREG32(mmIH_RB_CNTL);
>>>>>>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>>> WPTR_OVERFLOW_CLEAR,
>>>>>>>>>>>> 0);
>>>>>>>>>>>> +        WREG32(mmIH_RB_CNTL, tmp);
>>>>>>>>>>>> +        ih->overflow = false;
>>>>>>>>>>>> +    }
>>>>>>>>>>>>    }
>>>>>>>>>>>>
>>>>>>>>>>>>    static int tonga_ih_early_init(void *handle)
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
>>>>>>>>>>>> index d364c6dd152c..bb005924f194 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
>>>>>>>>>>>> @@ -372,6 +372,7 @@ static u32 vega10_ih_get_wptr(struct
>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>        tmp = RREG32_NO_KIQ(ih_regs->ih_rb_cntl);
>>>>>>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>>> WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>>>>>>        WREG32_NO_KIQ(ih_regs->ih_rb_cntl, tmp);
>>>>>>>>>>>> +    ih->overflow = true;
>>>>>>>>>>>>
>>>>>>>>>>>>    out:
>>>>>>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>>>>>>> @@ -413,6 +414,7 @@ static void vega10_ih_irq_rearm(struct
>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>    static void vega10_ih_set_rptr(struct amdgpu_device *adev,
>>>>>>>>>>>>                       struct amdgpu_ih_ring *ih)
>>>>>>>>>>>>    {
>>>>>>>>>>>> +    u32 tmp;
>>>>>>>>>>>>        struct amdgpu_ih_regs *ih_regs;
>>>>>>>>>>>>
>>>>>>>>>>>>        if (ih == &adev->irq.ih_soft)
>>>>>>>>>>>> @@ -429,6 +431,16 @@ static void vega10_ih_set_rptr(struct
>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>            ih_regs = &ih->ih_regs;
>>>>>>>>>>>>            WREG32(ih_regs->ih_rb_rptr, ih->rptr);
>>>>>>>>>>>>        }
>>>>>>>>>>>> +
>>>>>>>>>>>> +    /* If we overflowed previously (and thus set the
>>>>>>>>>>>> OVERFLOW_CLEAR
>>>>>>>>>>>> bit),
>>>>>>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>>>>>>> +     */
>>>>>>>>>>>> +    if (ih->overflow) {
>>>>>>>>>>>> +        tmp = RREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl);
>>>>>>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>>> WPTR_OVERFLOW_CLEAR,
>>>>>>>>>>>> 0);
>>>>>>>>>>>> + WREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl, tmp);
>>>>>>>>>>>> +        ih->overflow = false;
>>>>>>>>>>>> +    }
>>>>>>>>>>>>    }
>>>>>>>>>>>>
>>>>>>>>>>>>    /**
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
>>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
>>>>>>>>>>>> index ddfc6941f9d5..bb725a970697 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
>>>>>>>>>>>> @@ -420,6 +420,7 @@ static u32 vega20_ih_get_wptr(struct
>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>        tmp = RREG32_NO_KIQ(ih_regs->ih_rb_cntl);
>>>>>>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>>> WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>>>>>>        WREG32_NO_KIQ(ih_regs->ih_rb_cntl, tmp);
>>>>>>>>>>>> +    ih->overflow = true;
>>>>>>>>>>>>
>>>>>>>>>>>>    out:
>>>>>>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>>>>>>> @@ -462,6 +463,7 @@ static void vega20_ih_irq_rearm(struct
>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>    static void vega20_ih_set_rptr(struct amdgpu_device *adev,
>>>>>>>>>>>>                       struct amdgpu_ih_ring *ih)
>>>>>>>>>>>>    {
>>>>>>>>>>>> +    u32 tmp;
>>>>>>>>>>>>        struct amdgpu_ih_regs *ih_regs;
>>>>>>>>>>>>
>>>>>>>>>>>>        if (ih == &adev->irq.ih_soft)
>>>>>>>>>>>> @@ -478,6 +480,16 @@ static void vega20_ih_set_rptr(struct
>>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>>            ih_regs = &ih->ih_regs;
>>>>>>>>>>>>            WREG32(ih_regs->ih_rb_rptr, ih->rptr);
>>>>>>>>>>>>        }
>>>>>>>>>>>> +
>>>>>>>>>>>> +    /* If we overflowed previously (and thus set the
>>>>>>>>>>>> OVERFLOW_CLEAR
>>>>>>>>>>>> bit),
>>>>>>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>>>>>>> +     */
>>>>>>>>>>>> +    if (ih->overflow) {
>>>>>>>>>>>> +        tmp = RREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl);
>>>>>>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>>> WPTR_OVERFLOW_CLEAR,
>>>>>>>>>>>> 0);
>>>>>>>>>>>> + WREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl, tmp);
>>>>>>>>>>>> +        ih->overflow = false;
>>>>>>>>>>>> +    }
>>>>>>>>>>>>    }
>>>>>>>>>>>>
>>>>>>>>>>>>    /**
>>>>>>>>>>>> --
>>>>>>>>>>>> 2.43.0
>>>>>>>>>>>>
>>>>
>>