[PATCH 1/2] drm/amdgpu: Reset IH OVERFLOW_CLEAR bit after writing rptr

Wed Jan 17 23:44:53 UTC 2024

On 18.01.24 00:00, Alex Deucher wrote:
> On Wed, Jan 17, 2024 at 7:36 AM Christian König
> <ckoenig.leichtzumerken at gmail.com> wrote:
>> Am 16.01.24 um 11:31 schrieb Friedrich Vock:
>>> On 16.01.24 08:03, Christian König wrote:
>>>> Am 15.01.24 um 12:18 schrieb Friedrich Vock:
>>>>> [SNIP]
>>>>>>> +    if (ih->overflow) {
>>>>>>> +        tmp = RREG32(mmIH_RB_CNTL);
>>>>>>> +        tmp &= ~IH_RB_CNTL__WPTR_OVERFLOW_CLEAR_MASK;
>>>>>>> +        WREG32(mmIH_RB_CNTL, tmp);
>>>>>>> +        ih->overflow = false;
>>>>>>> +    }
>>>>>> Well that is an extremely bad idea. We already reset the overflow
>>>>>> after reading the WPTR.
>>>>> This is not resetting the overflow bit. This is resetting a "clear
>>>>> overflow" bit. I don't have the hardware docs, but the name (and my
>>>>> observations) strongly suggest that setting this bit actually prevents
>>>>> the hardware from setting the overflow bit ever again.
>>>> Well that doesn't make any sense at all. The hardware documentation
>>>> clearly states that this bit is write only and should always read as
>>>> zero.
>>>>
>>>> Setting this bit will clear the overflow flag in the WPTR register and
>>>> clearing it has no effect at all.
>>>>
>>>> I could only ping the hw engineer responsible for this block to double
>>>> check if the documentation is somehow outdated, but I really doubt so.
>>>>
>>> I see. I wish I had access to the documentation,
>> Well, doesn't Valve has an NDA in place?
>>
>>> but I don't, so all I
>>> can do is tell you what I observe the hardware doing. I've tested this
>>> on both a Steam Deck (OSSYS 5.2.0) and an RX 6700 XT (OSSYS 5.0.3). On
>>> both systems, launching a bunch of shaders that cause page faults leads
>>> to lots of "[gfxhub] page fault" messages in dmesg, followed by an
>>> "amdgpu: IH ring buffer overflow".
>> Well that is certainly a bug, maybe even the same thing we have seen on
>> Vega and MI.
>>
>> What we could do is to try to apply the same workaround to re-route the
>> page faults to a different IH ring.
>>
>> See those patches here as well:
>>
>> commit 516bc3d8dd7965f1a8a3ea453857f14d95971e62
>> Author: Christian König <christian.koenig at amd.com>
>> Date:   Fri Nov 2 15:00:16 2018 +0100
>>
>>       drm/amdgpu: reroute VMC and UMD to IH ring 1
>>
>>       Page faults can easily overwhelm the interrupt handler.
>>
>>       So to make sure that we never lose valuable interrupts on the
>> primary ring
>>       we re-route page faults to IH ring 1.
>>
>> commit b849aaa41c914a0fd88003f88cb04420a873c624
>> Author: Christian König <christian.koenig at amd.com>
>> Date:   Mon Mar 4 19:34:34 2019 +0100
>>
>>       drm/amdgpu: also reroute VMC and UMD to IH ring 1 on Vega 20
>>
>>       Same patch we alredy did for Vega10. Just re-route page faults to a
>> separate
>>       ring to avoid drowning in interrupts.
>>
>>> If I re-launch the same set of shaders after the GPU has soft-recovered,
>>> the "amdgpu: IH ring buffer overflow" message is missing, even though
>>> the same amount of page faults should've been triggered at roughly the
>>> same rate. Running with this patch applied makes more "amdgpu: IH ring
>>> buffer overflow" messages appear after relaunching the faulting shaders
>>> (but not when processing any non-faulting work).
>> That is actually the expected behavior. There should be a limit on the
>> number of faults written to the ring so that the ring never overflows.
>>
>>> The only possible conclusion I can draw from this is that clearing that
>>> bit *does* have an effect, and I don't think it's far-fetched to assume
>>> the IH ring buffer overflows still happen after re-launching the
>>> faulting shaders but go undetected so far.
>> Well that can only mean that the hw documentation is incorrect.
>>
>> Either the value is not write only trigger bit as documented or we need
>> an additional read of the register for it to take effect or something
>> like this.
>>
>>>>> Right now, IH overflows, even if they occur repeatedly, only get
>>>>> registered once. If not registering IH overflows can trivially lead to
>>>>> system crashes, it's amdgpu's current handling that is broken.
>>>> It's years that we last tested this but according to the HW
>>>> documentation this should work fine.
>>>>
>>>> What could potentially happen is that the IH has silenced the source
>>>> of the overflow. We never implemented resetting those, but in this
>>>> case that here won't help either.
>>>>
>>> If the IH silenced the page faults (which quite clearly cause the
>>> overflow here), then how are the page faults still logged in dmesg?
>> There should be a hardware rate limit for the page faults, e.g. there
>> can only be X faults reported in N clock cycles and then a delay is
>> inserted.
> @Christian Koenig  Is that tied to xnack (i.e., noretry)?  The default
> is noretry=1 on gfx10.3 and newer.  But it can be overridden.  It was
> not set on some older kernels, maybe that is the problem?  @Friedrich
> Vock does setting amdgpu.noretry=1 fix the issue?

No, amdgpu.noretry=1 does not change anything.

Regards,
Friedrich

> Alex
>
>>>>> The possibility of a repeated IH overflow in between reading the wptr
>>>>> and updating the rptr is a good point, but how can we detect that at
>>>>> all? It seems to me like we can't set the OVERFLOW_CLEAR bit at all
>>>>> then, because we're guaranteed to miss any overflows that happen while
>>>>> the bit is set.
>>>> When an IH overflow is signaled we clear that flag by writing 1 into
>>>> the OVERFLOW_CLEAR bit and skip one entry in the IH ring buffer.
>>>>
>>>> What can of course happen is that the IH ring buffer overflows more
>>>> than this single entry and we process IVs which are potentially
>>>> corrupted, but we won't miss any additional overflows since we only
>>>> start processing after resetting the flag.
>>>>
>>>> An IH overflow is also something you should *never* see in a
>>>> production system. This is purely for driver bringup and as fallback
>>>> when there is a severe incorrect programming of the HW.
>>>>
>>>> The only exception of that is page fault handling on MI products
>>>> because of a hardware bug, to mitigate this we are processing page
>>>> faults on a separate IH ring on those parts.
>>>>
>>>> On all other hw generations the IH should have some rate limit for the
>>>> number of faults generated per second, so that the CPU is always able
>>>> to catch up.
>>> I'm wondering if there is another bug in here somewhere. Your
>>> explanation of how it's supposed to work makes a lot of sense, but from
>>> what I can tell it doesn't work that way when I test it.
>>>
>>>  From the printk_ratelimit stats it would seem like >2000 faults arrive
>>> in less than a second, so perhaps your theory about fault interrupt
>>> ratelimiting not working is correct (but it's hard for me to verify what
>>> is going on without the documentation).
>> I'm going to ping the relevant engineer and putting someone on the task
>> to take a look.
>>
>> Thanks,
>> Christian.
>>
>>> Regards,
>>> Friedrich
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Regards,
>>>>> Friedrich
>>>>>
>>>>>> When you clear the overflow again when updating the RPTR you could
>>>>>> loose another overflow which might have happened in between and so
>>>>>> potentially process corrupted IVs.
>>>>>>
>>>>>> That can trivially crash the system.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>    }
>>>>>>>
>>>>>>>    static int cik_ih_early_init(void *handle)
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/cz_ih.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
>>>>>>> index b8c47e0cf37a..076559668573 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/cz_ih.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
>>>>>>> @@ -215,7 +215,7 @@ static u32 cz_ih_get_wptr(struct amdgpu_device
>>>>>>> *adev,
>>>>>>>        tmp = RREG32(mmIH_RB_CNTL);
>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>        WREG32(mmIH_RB_CNTL, tmp);
>>>>>>> -
>>>>>>> +    ih->overflow = true;
>>>>>>>
>>>>>>>    out:
>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>> @@ -266,7 +266,19 @@ static void cz_ih_decode_iv(struct amdgpu_device
>>>>>>> *adev,
>>>>>>>    static void cz_ih_set_rptr(struct amdgpu_device *adev,
>>>>>>>                   struct amdgpu_ih_ring *ih)
>>>>>>>    {
>>>>>>> +    u32 tmp;
>>>>>>> +
>>>>>>>        WREG32(mmIH_RB_RPTR, ih->rptr);
>>>>>>> +
>>>>>>> +    /* If we overflowed previously (and thus set the OVERFLOW_CLEAR
>>>>>>> bit),
>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>> +     */
>>>>>>> +    if (ih->overflow) {
>>>>>>> +        tmp = RREG32(mmIH_RB_CNTL);
>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR,
>>>>>>> 0);
>>>>>>> +        WREG32(mmIH_RB_CNTL, tmp);
>>>>>>> +        ih->overflow = false;
>>>>>>> +    }
>>>>>>>    }
>>>>>>>
>>>>>>>    static int cz_ih_early_init(void *handle)
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
>>>>>>> index aecad530b10a..1a5e668643d1 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
>>>>>>> @@ -214,7 +214,7 @@ static u32 iceland_ih_get_wptr(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>        tmp = RREG32(mmIH_RB_CNTL);
>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>        WREG32(mmIH_RB_CNTL, tmp);
>>>>>>> -
>>>>>>> +    ih->overflow = true;
>>>>>>>
>>>>>>>    out:
>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>> @@ -265,7 +265,19 @@ static void iceland_ih_decode_iv(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>    static void iceland_ih_set_rptr(struct amdgpu_device *adev,
>>>>>>>                    struct amdgpu_ih_ring *ih)
>>>>>>>    {
>>>>>>> +    u32 tmp;
>>>>>>> +
>>>>>>>        WREG32(mmIH_RB_RPTR, ih->rptr);
>>>>>>> +
>>>>>>> +    /* If we overflowed previously (and thus set the OVERFLOW_CLEAR
>>>>>>> bit),
>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>> +     */
>>>>>>> +    if (ih->overflow) {
>>>>>>> +        tmp = RREG32(mmIH_RB_CNTL);
>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR,
>>>>>>> 0);
>>>>>>> +        WREG32(mmIH_RB_CNTL, tmp);
>>>>>>> +        ih->overflow = false;
>>>>>>> +    }
>>>>>>>    }
>>>>>>>
>>>>>>>    static int iceland_ih_early_init(void *handle)
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
>>>>>>> index d9ed7332d805..ce8f7feec713 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
>>>>>>> @@ -418,6 +418,8 @@ static u32 ih_v6_0_get_wptr(struct amdgpu_device
>>>>>>> *adev,
>>>>>>>        tmp = RREG32_NO_KIQ(ih_regs->ih_rb_cntl);
>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>        WREG32_NO_KIQ(ih_regs->ih_rb_cntl, tmp);
>>>>>>> +    ih->overflow = true;
>>>>>>> +
>>>>>>>    out:
>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>>    }
>>>>>>> @@ -459,6 +461,7 @@ static void ih_v6_0_irq_rearm(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>    static void ih_v6_0_set_rptr(struct amdgpu_device *adev,
>>>>>>>                       struct amdgpu_ih_ring *ih)
>>>>>>>    {
>>>>>>> +    u32 tmp;
>>>>>>>        struct amdgpu_ih_regs *ih_regs;
>>>>>>>
>>>>>>>        if (ih->use_doorbell) {
>>>>>>> @@ -472,6 +475,16 @@ static void ih_v6_0_set_rptr(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>            ih_regs = &ih->ih_regs;
>>>>>>>            WREG32(ih_regs->ih_rb_rptr, ih->rptr);
>>>>>>>        }
>>>>>>> +
>>>>>>> +    /* If we overflowed previously (and thus set the OVERFLOW_CLEAR
>>>>>>> bit),
>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>> +     */
>>>>>>> +    if (ih->overflow) {
>>>>>>> +        tmp = RREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl);
>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR,
>>>>>>> 0);
>>>>>>> +        WREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl, tmp);
>>>>>>> +        ih->overflow = false;
>>>>>>> +    }
>>>>>>>    }
>>>>>>>
>>>>>>>    /**
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/ih_v6_1.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/ih_v6_1.c
>>>>>>> index 8fb05eae340a..668788ad34d9 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/ih_v6_1.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/ih_v6_1.c
>>>>>>> @@ -418,6 +418,8 @@ static u32 ih_v6_1_get_wptr(struct amdgpu_device
>>>>>>> *adev,
>>>>>>>        tmp = RREG32_NO_KIQ(ih_regs->ih_rb_cntl);
>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>        WREG32_NO_KIQ(ih_regs->ih_rb_cntl, tmp);
>>>>>>> +    ih->overflow = true;
>>>>>>> +
>>>>>>>    out:
>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>>    }
>>>>>>> @@ -459,6 +461,7 @@ static void ih_v6_1_irq_rearm(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>    static void ih_v6_1_set_rptr(struct amdgpu_device *adev,
>>>>>>>                       struct amdgpu_ih_ring *ih)
>>>>>>>    {
>>>>>>> +    u32 tmp;
>>>>>>>        struct amdgpu_ih_regs *ih_regs;
>>>>>>>
>>>>>>>        if (ih->use_doorbell) {
>>>>>>> @@ -472,6 +475,16 @@ static void ih_v6_1_set_rptr(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>            ih_regs = &ih->ih_regs;
>>>>>>>            WREG32(ih_regs->ih_rb_rptr, ih->rptr);
>>>>>>>        }
>>>>>>> +
>>>>>>> +    /* If we overflowed previously (and thus set the OVERFLOW_CLEAR
>>>>>>> bit),
>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>> +     */
>>>>>>> +    if (ih->overflow) {
>>>>>>> +        tmp = RREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl);
>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR,
>>>>>>> 0);
>>>>>>> +        WREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl, tmp);
>>>>>>> +        ih->overflow = false;
>>>>>>> +    }
>>>>>>>    }
>>>>>>>
>>>>>>>    /**
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>>>>>>> index e64b33115848..0bdac923cb4d 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>>>>>>> @@ -442,6 +442,7 @@ static u32 navi10_ih_get_wptr(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>        tmp = RREG32_NO_KIQ(ih_regs->ih_rb_cntl);
>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>        WREG32_NO_KIQ(ih_regs->ih_rb_cntl, tmp);
>>>>>>> +    ih->overflow = true;
>>>>>>>    out:
>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>>    }
>>>>>>> @@ -483,6 +484,7 @@ static void navi10_ih_irq_rearm(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>    static void navi10_ih_set_rptr(struct amdgpu_device *adev,
>>>>>>>                       struct amdgpu_ih_ring *ih)
>>>>>>>    {
>>>>>>> +    u32 tmp;
>>>>>>>        struct amdgpu_ih_regs *ih_regs;
>>>>>>>
>>>>>>>        if (ih == &adev->irq.ih_soft)
>>>>>>> @@ -499,6 +501,16 @@ static void navi10_ih_set_rptr(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>            ih_regs = &ih->ih_regs;
>>>>>>>            WREG32(ih_regs->ih_rb_rptr, ih->rptr);
>>>>>>>        }
>>>>>>> +
>>>>>>> +    /* If we overflowed previously (and thus set the OVERFLOW_CLEAR
>>>>>>> bit),
>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>> +     */
>>>>>>> +    if (ih->overflow) {
>>>>>>> +        tmp = RREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl);
>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR,
>>>>>>> 0);
>>>>>>> +        WREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl, tmp);
>>>>>>> +        ih->overflow = false;
>>>>>>> +    }
>>>>>>>    }
>>>>>>>
>>>>>>>    /**
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/si_ih.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/si_ih.c
>>>>>>> index 9a24f17a5750..ff35056d2b54 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/si_ih.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/si_ih.c
>>>>>>> @@ -119,6 +119,7 @@ static u32 si_ih_get_wptr(struct amdgpu_device
>>>>>>> *adev,
>>>>>>>            tmp = RREG32(IH_RB_CNTL);
>>>>>>>            tmp |= IH_RB_CNTL__WPTR_OVERFLOW_CLEAR_MASK;
>>>>>>>            WREG32(IH_RB_CNTL, tmp);
>>>>>>> +        ih->overflow = true;
>>>>>>>        }
>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>>    }
>>>>>>> @@ -147,7 +148,18 @@ static void si_ih_decode_iv(struct amdgpu_device
>>>>>>> *adev,
>>>>>>>    static void si_ih_set_rptr(struct amdgpu_device *adev,
>>>>>>>                   struct amdgpu_ih_ring *ih)
>>>>>>>    {
>>>>>>> +    u32 tmp;
>>>>>>> +
>>>>>>>        WREG32(IH_RB_RPTR, ih->rptr);
>>>>>>> +
>>>>>>> +    /* If we overflowed previously (and thus set the OVERFLOW_CLEAR
>>>>>>> bit),
>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>> +     */
>>>>>>> +    if (ih->overflow) {
>>>>>>> +        tmp = RREG32(IH_RB_CNTL);
>>>>>>> +        tmp &= ~IH_RB_CNTL__WPTR_OVERFLOW_CLEAR_MASK;
>>>>>>> +        WREG32(IH_RB_CNTL, tmp);
>>>>>>> +    }
>>>>>>>    }
>>>>>>>
>>>>>>>    static int si_ih_early_init(void *handle)
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
>>>>>>> index 917707bba7f3..6f5090d3db48 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
>>>>>>> @@ -218,6 +218,7 @@ static u32 tonga_ih_get_wptr(struct amdgpu_device
>>>>>>> *adev,
>>>>>>>        tmp = RREG32(mmIH_RB_CNTL);
>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>        WREG32(mmIH_RB_CNTL, tmp);
>>>>>>> +    ih->overflow = true;
>>>>>>>
>>>>>>>    out:
>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>> @@ -268,6 +269,8 @@ static void tonga_ih_decode_iv(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>    static void tonga_ih_set_rptr(struct amdgpu_device *adev,
>>>>>>>                      struct amdgpu_ih_ring *ih)
>>>>>>>    {
>>>>>>> +    u32 tmp;
>>>>>>> +
>>>>>>>        if (ih->use_doorbell) {
>>>>>>>            /* XXX check if swapping is necessary on BE */
>>>>>>>            *ih->rptr_cpu = ih->rptr;
>>>>>>> @@ -275,6 +278,16 @@ static void tonga_ih_set_rptr(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>        } else {
>>>>>>>            WREG32(mmIH_RB_RPTR, ih->rptr);
>>>>>>>        }
>>>>>>> +
>>>>>>> +    /* If we overflowed previously (and thus set the OVERFLOW_CLEAR
>>>>>>> bit),
>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>> +     */
>>>>>>> +    if (ih->overflow) {
>>>>>>> +        tmp = RREG32(mmIH_RB_CNTL);
>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR,
>>>>>>> 0);
>>>>>>> +        WREG32(mmIH_RB_CNTL, tmp);
>>>>>>> +        ih->overflow = false;
>>>>>>> +    }
>>>>>>>    }
>>>>>>>
>>>>>>>    static int tonga_ih_early_init(void *handle)
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
>>>>>>> index d364c6dd152c..bb005924f194 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
>>>>>>> @@ -372,6 +372,7 @@ static u32 vega10_ih_get_wptr(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>        tmp = RREG32_NO_KIQ(ih_regs->ih_rb_cntl);
>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>        WREG32_NO_KIQ(ih_regs->ih_rb_cntl, tmp);
>>>>>>> +    ih->overflow = true;
>>>>>>>
>>>>>>>    out:
>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>> @@ -413,6 +414,7 @@ static void vega10_ih_irq_rearm(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>    static void vega10_ih_set_rptr(struct amdgpu_device *adev,
>>>>>>>                       struct amdgpu_ih_ring *ih)
>>>>>>>    {
>>>>>>> +    u32 tmp;
>>>>>>>        struct amdgpu_ih_regs *ih_regs;
>>>>>>>
>>>>>>>        if (ih == &adev->irq.ih_soft)
>>>>>>> @@ -429,6 +431,16 @@ static void vega10_ih_set_rptr(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>            ih_regs = &ih->ih_regs;
>>>>>>>            WREG32(ih_regs->ih_rb_rptr, ih->rptr);
>>>>>>>        }
>>>>>>> +
>>>>>>> +    /* If we overflowed previously (and thus set the OVERFLOW_CLEAR
>>>>>>> bit),
>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>> +     */
>>>>>>> +    if (ih->overflow) {
>>>>>>> +        tmp = RREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl);
>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR,
>>>>>>> 0);
>>>>>>> +        WREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl, tmp);
>>>>>>> +        ih->overflow = false;
>>>>>>> +    }
>>>>>>>    }
>>>>>>>
>>>>>>>    /**
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
>>>>>>> index ddfc6941f9d5..bb725a970697 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
>>>>>>> @@ -420,6 +420,7 @@ static u32 vega20_ih_get_wptr(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>        tmp = RREG32_NO_KIQ(ih_regs->ih_rb_cntl);
>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>        WREG32_NO_KIQ(ih_regs->ih_rb_cntl, tmp);
>>>>>>> +    ih->overflow = true;
>>>>>>>
>>>>>>>    out:
>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>> @@ -462,6 +463,7 @@ static void vega20_ih_irq_rearm(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>    static void vega20_ih_set_rptr(struct amdgpu_device *adev,
>>>>>>>                       struct amdgpu_ih_ring *ih)
>>>>>>>    {
>>>>>>> +    u32 tmp;
>>>>>>>        struct amdgpu_ih_regs *ih_regs;
>>>>>>>
>>>>>>>        if (ih == &adev->irq.ih_soft)
>>>>>>> @@ -478,6 +480,16 @@ static void vega20_ih_set_rptr(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>            ih_regs = &ih->ih_regs;
>>>>>>>            WREG32(ih_regs->ih_rb_rptr, ih->rptr);
>>>>>>>        }
>>>>>>> +
>>>>>>> +    /* If we overflowed previously (and thus set the OVERFLOW_CLEAR
>>>>>>> bit),
>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>> +     */
>>>>>>> +    if (ih->overflow) {
>>>>>>> +        tmp = RREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl);
>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR,
>>>>>>> 0);
>>>>>>> +        WREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl, tmp);
>>>>>>> +        ih->overflow = false;
>>>>>>> +    }
>>>>>>>    }
>>>>>>>
>>>>>>>    /**
>>>>>>> --
>>>>>>> 2.43.0
>>>>>>>