[PATCH 1/2] drm/amdgpu: Reset IH OVERFLOW_CLEAR bit after writing rptr

Mon Jan 22 10:21:24 UTC 2024

On 22.01.24 11:10, Christian König wrote:
> Am 19.01.24 um 20:18 schrieb Felix Kuehling:
>> On 2024-01-18 07:07, Christian König wrote:
>>> Am 18.01.24 um 00:44 schrieb Friedrich Vock:
>>>> On 18.01.24 00:00, Alex Deucher wrote:
>>>>> [SNIP]
>>>>>>>>> Right now, IH overflows, even if they occur repeatedly, only get
>>>>>>>>> registered once. If not registering IH overflows can trivially
>>>>>>>>> lead to
>>>>>>>>> system crashes, it's amdgpu's current handling that is broken.
>>>>>>>> It's years that we last tested this but according to the HW
>>>>>>>> documentation this should work fine.
>>>>>>>>
>>>>>>>> What could potentially happen is that the IH has silenced the
>>>>>>>> source
>>>>>>>> of the overflow. We never implemented resetting those, but in this
>>>>>>>> case that here won't help either.
>>>>>>>>
>>>>>>> If the IH silenced the page faults (which quite clearly cause the
>>>>>>> overflow here), then how are the page faults still logged in dmesg?
>>>>>> There should be a hardware rate limit for the page faults, e.g.
>>>>>> there
>>>>>> can only be X faults reported in N clock cycles and then a delay is
>>>>>> inserted.
>>>>> @Christian Koenig  Is that tied to xnack (i.e., noretry)? The default
>>>>> is noretry=1 on gfx10.3 and newer.  But it can be overridden. It was
>>>>> not set on some older kernels, maybe that is the problem? @Friedrich
>>>>> Vock does setting amdgpu.noretry=1 fix the issue?
>>>>
>>>>
>>>> No, amdgpu.noretry=1 does not change anything.
>>>
>>> Well the good news first the hw engineer answered rather quickly.
>>> The bad news is that the hardware really doesn't work as documented
>>> in multiple ways.
>>>
>>> First of all the CLEAR bit is a level and not a trigger, so the
>>> intention to clear it is indeed correct. For now please modify this
>>> patch so that the CLEAR bit is set and cleared directly after
>>> setting it, this way we should be able to detect further overflows
>>> immediately.
>>>
>>> Then the APU the Steam Deck uses simply doesn't have the filter
>>> function for page faults in the hardware, the really bad news is it
>>> also doesn't have the extra IH rings where we could re-route the
>>> faults to prevent overflows.
>>>
>>> That full explains the behavior you have been seeing, but doesn't
>>> really provide a doable solution to mitigate this problem.
>>>
>>> I'm going to dig deeper into the hw documentation and specification
>>> to see if we can use a different feature to avoid the overflow.
>>
>> If we're not enabling retry faults, then each wave front should
>> generate at most one fault. You should be able to avoid overflows by
>> making the IH ring large enough to accommodate one fault per wave front.
>
> That is the exact same argument our HW engineers came up with when we
> asked why the APU is missing all those nice IH ring overflow avoidance
> features the dGPUs have :)
>
I can reproduce IH overflows on my RX 6700 XT dGPU as well FWIW.

> The only problem with this approach is that on Navi when a wave is
> blocked by waiting on a fault you can't kill it using soft recovery
> any more (at least when my understanding is correct).
>
Killing page-faulted waves via soft recovery works. From my testing on
Deck, it seems to take a bit of time, but if you try for long enough
soft recovery eventually succeeds.

Regards,
Friedrich

>>
>> If the faults are coming from SDMA, that may be another problem. I'm
>> not sure whether it can generate multiple no-retry faults from the
>> same queue.
>
> Regarding faults the SDMA is relatively harmless compared to the 3D
> engine, IIRC the resolve queue is something like 128 entries deep. So
> you never see more than those 128 faults if I'm not completely mistaken.
>
> Sunil is setting up a test system for this in an AMD lab and will play
> around with a few HW features to mitigate the issue. I still hope that
> we can completely avoid the overflow altogether.
>
> Regards,
> Christian.
>
>>
>> Regards,
>>   Felix
>>
>>
>>>
>>> Thanks,
>>> Christian.
>>>
>>>>
>>>> Regards,
>>>> Friedrich
>>>>
>>>>> Alex
>>>>>
>>>>>>>>> The possibility of a repeated IH overflow in between reading
>>>>>>>>> the wptr
>>>>>>>>> and updating the rptr is a good point, but how can we detect
>>>>>>>>> that at
>>>>>>>>> all? It seems to me like we can't set the OVERFLOW_CLEAR bit
>>>>>>>>> at all
>>>>>>>>> then, because we're guaranteed to miss any overflows that
>>>>>>>>> happen while
>>>>>>>>> the bit is set.
>>>>>>>> When an IH overflow is signaled we clear that flag by writing 1
>>>>>>>> into
>>>>>>>> the OVERFLOW_CLEAR bit and skip one entry in the IH ring buffer.
>>>>>>>>
>>>>>>>> What can of course happen is that the IH ring buffer overflows
>>>>>>>> more
>>>>>>>> than this single entry and we process IVs which are potentially
>>>>>>>> corrupted, but we won't miss any additional overflows since we
>>>>>>>> only
>>>>>>>> start processing after resetting the flag.
>>>>>>>>
>>>>>>>> An IH overflow is also something you should *never* see in a
>>>>>>>> production system. This is purely for driver bringup and as
>>>>>>>> fallback
>>>>>>>> when there is a severe incorrect programming of the HW.
>>>>>>>>
>>>>>>>> The only exception of that is page fault handling on MI products
>>>>>>>> because of a hardware bug, to mitigate this we are processing page
>>>>>>>> faults on a separate IH ring on those parts.
>>>>>>>>
>>>>>>>> On all other hw generations the IH should have some rate limit
>>>>>>>> for the
>>>>>>>> number of faults generated per second, so that the CPU is
>>>>>>>> always able
>>>>>>>> to catch up.
>>>>>>> I'm wondering if there is another bug in here somewhere. Your
>>>>>>> explanation of how it's supposed to work makes a lot of sense,
>>>>>>> but from
>>>>>>> what I can tell it doesn't work that way when I test it.
>>>>>>>
>>>>>>>  From the printk_ratelimit stats it would seem like >2000 faults
>>>>>>> arrive
>>>>>>> in less than a second, so perhaps your theory about fault interrupt
>>>>>>> ratelimiting not working is correct (but it's hard for me to
>>>>>>> verify what
>>>>>>> is going on without the documentation).
>>>>>> I'm going to ping the relevant engineer and putting someone on
>>>>>> the task
>>>>>> to take a look.
>>>>>>
>>>>>> Thanks,
>>>>>> Christian.
>>>>>>
>>>>>>> Regards,
>>>>>>> Friedrich
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Friedrich
>>>>>>>>>
>>>>>>>>>> When you clear the overflow again when updating the RPTR you
>>>>>>>>>> could
>>>>>>>>>> loose another overflow which might have happened in between
>>>>>>>>>> and so
>>>>>>>>>> potentially process corrupted IVs.
>>>>>>>>>>
>>>>>>>>>> That can trivially crash the system.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>>>    }
>>>>>>>>>>>
>>>>>>>>>>>    static int cik_ih_early_init(void *handle)
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/cz_ih.c
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
>>>>>>>>>>> index b8c47e0cf37a..076559668573 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/cz_ih.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
>>>>>>>>>>> @@ -215,7 +215,7 @@ static u32 cz_ih_get_wptr(struct
>>>>>>>>>>> amdgpu_device
>>>>>>>>>>> *adev,
>>>>>>>>>>>        tmp = RREG32(mmIH_RB_CNTL);
>>>>>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>> WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>>>>>        WREG32(mmIH_RB_CNTL, tmp);
>>>>>>>>>>> -
>>>>>>>>>>> +    ih->overflow = true;
>>>>>>>>>>>
>>>>>>>>>>>    out:
>>>>>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>>>>>> @@ -266,7 +266,19 @@ static void cz_ih_decode_iv(struct
>>>>>>>>>>> amdgpu_device
>>>>>>>>>>> *adev,
>>>>>>>>>>>    static void cz_ih_set_rptr(struct amdgpu_device *adev,
>>>>>>>>>>>                   struct amdgpu_ih_ring *ih)
>>>>>>>>>>>    {
>>>>>>>>>>> +    u32 tmp;
>>>>>>>>>>> +
>>>>>>>>>>>        WREG32(mmIH_RB_RPTR, ih->rptr);
>>>>>>>>>>> +
>>>>>>>>>>> +    /* If we overflowed previously (and thus set the
>>>>>>>>>>> OVERFLOW_CLEAR
>>>>>>>>>>> bit),
>>>>>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>>>>>> +     */
>>>>>>>>>>> +    if (ih->overflow) {
>>>>>>>>>>> +        tmp = RREG32(mmIH_RB_CNTL);
>>>>>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>> WPTR_OVERFLOW_CLEAR,
>>>>>>>>>>> 0);
>>>>>>>>>>> +        WREG32(mmIH_RB_CNTL, tmp);
>>>>>>>>>>> +        ih->overflow = false;
>>>>>>>>>>> +    }
>>>>>>>>>>>    }
>>>>>>>>>>>
>>>>>>>>>>>    static int cz_ih_early_init(void *handle)
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
>>>>>>>>>>> index aecad530b10a..1a5e668643d1 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
>>>>>>>>>>> @@ -214,7 +214,7 @@ static u32 iceland_ih_get_wptr(struct
>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>        tmp = RREG32(mmIH_RB_CNTL);
>>>>>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>> WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>>>>>        WREG32(mmIH_RB_CNTL, tmp);
>>>>>>>>>>> -
>>>>>>>>>>> +    ih->overflow = true;
>>>>>>>>>>>
>>>>>>>>>>>    out:
>>>>>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>>>>>> @@ -265,7 +265,19 @@ static void iceland_ih_decode_iv(struct
>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>    static void iceland_ih_set_rptr(struct amdgpu_device *adev,
>>>>>>>>>>>                    struct amdgpu_ih_ring *ih)
>>>>>>>>>>>    {
>>>>>>>>>>> +    u32 tmp;
>>>>>>>>>>> +
>>>>>>>>>>>        WREG32(mmIH_RB_RPTR, ih->rptr);
>>>>>>>>>>> +
>>>>>>>>>>> +    /* If we overflowed previously (and thus set the
>>>>>>>>>>> OVERFLOW_CLEAR
>>>>>>>>>>> bit),
>>>>>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>>>>>> +     */
>>>>>>>>>>> +    if (ih->overflow) {
>>>>>>>>>>> +        tmp = RREG32(mmIH_RB_CNTL);
>>>>>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>> WPTR_OVERFLOW_CLEAR,
>>>>>>>>>>> 0);
>>>>>>>>>>> +        WREG32(mmIH_RB_CNTL, tmp);
>>>>>>>>>>> +        ih->overflow = false;
>>>>>>>>>>> +    }
>>>>>>>>>>>    }
>>>>>>>>>>>
>>>>>>>>>>>    static int iceland_ih_early_init(void *handle)
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
>>>>>>>>>>> index d9ed7332d805..ce8f7feec713 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
>>>>>>>>>>> @@ -418,6 +418,8 @@ static u32 ih_v6_0_get_wptr(struct
>>>>>>>>>>> amdgpu_device
>>>>>>>>>>> *adev,
>>>>>>>>>>>        tmp = RREG32_NO_KIQ(ih_regs->ih_rb_cntl);
>>>>>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>> WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>>>>>        WREG32_NO_KIQ(ih_regs->ih_rb_cntl, tmp);
>>>>>>>>>>> +    ih->overflow = true;
>>>>>>>>>>> +
>>>>>>>>>>>    out:
>>>>>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>>>>>>    }
>>>>>>>>>>> @@ -459,6 +461,7 @@ static void ih_v6_0_irq_rearm(struct
>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>    static void ih_v6_0_set_rptr(struct amdgpu_device *adev,
>>>>>>>>>>>                       struct amdgpu_ih_ring *ih)
>>>>>>>>>>>    {
>>>>>>>>>>> +    u32 tmp;
>>>>>>>>>>>        struct amdgpu_ih_regs *ih_regs;
>>>>>>>>>>>
>>>>>>>>>>>        if (ih->use_doorbell) {
>>>>>>>>>>> @@ -472,6 +475,16 @@ static void ih_v6_0_set_rptr(struct
>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>            ih_regs = &ih->ih_regs;
>>>>>>>>>>>            WREG32(ih_regs->ih_rb_rptr, ih->rptr);
>>>>>>>>>>>        }
>>>>>>>>>>> +
>>>>>>>>>>> +    /* If we overflowed previously (and thus set the
>>>>>>>>>>> OVERFLOW_CLEAR
>>>>>>>>>>> bit),
>>>>>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>>>>>> +     */
>>>>>>>>>>> +    if (ih->overflow) {
>>>>>>>>>>> +        tmp = RREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl);
>>>>>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>> WPTR_OVERFLOW_CLEAR,
>>>>>>>>>>> 0);
>>>>>>>>>>> + WREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl, tmp);
>>>>>>>>>>> +        ih->overflow = false;
>>>>>>>>>>> +    }
>>>>>>>>>>>    }
>>>>>>>>>>>
>>>>>>>>>>>    /**
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/ih_v6_1.c
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/ih_v6_1.c
>>>>>>>>>>> index 8fb05eae340a..668788ad34d9 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/ih_v6_1.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/ih_v6_1.c
>>>>>>>>>>> @@ -418,6 +418,8 @@ static u32 ih_v6_1_get_wptr(struct
>>>>>>>>>>> amdgpu_device
>>>>>>>>>>> *adev,
>>>>>>>>>>>        tmp = RREG32_NO_KIQ(ih_regs->ih_rb_cntl);
>>>>>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>> WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>>>>>        WREG32_NO_KIQ(ih_regs->ih_rb_cntl, tmp);
>>>>>>>>>>> +    ih->overflow = true;
>>>>>>>>>>> +
>>>>>>>>>>>    out:
>>>>>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>>>>>>    }
>>>>>>>>>>> @@ -459,6 +461,7 @@ static void ih_v6_1_irq_rearm(struct
>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>    static void ih_v6_1_set_rptr(struct amdgpu_device *adev,
>>>>>>>>>>>                       struct amdgpu_ih_ring *ih)
>>>>>>>>>>>    {
>>>>>>>>>>> +    u32 tmp;
>>>>>>>>>>>        struct amdgpu_ih_regs *ih_regs;
>>>>>>>>>>>
>>>>>>>>>>>        if (ih->use_doorbell) {
>>>>>>>>>>> @@ -472,6 +475,16 @@ static void ih_v6_1_set_rptr(struct
>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>            ih_regs = &ih->ih_regs;
>>>>>>>>>>>            WREG32(ih_regs->ih_rb_rptr, ih->rptr);
>>>>>>>>>>>        }
>>>>>>>>>>> +
>>>>>>>>>>> +    /* If we overflowed previously (and thus set the
>>>>>>>>>>> OVERFLOW_CLEAR
>>>>>>>>>>> bit),
>>>>>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>>>>>> +     */
>>>>>>>>>>> +    if (ih->overflow) {
>>>>>>>>>>> +        tmp = RREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl);
>>>>>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>> WPTR_OVERFLOW_CLEAR,
>>>>>>>>>>> 0);
>>>>>>>>>>> + WREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl, tmp);
>>>>>>>>>>> +        ih->overflow = false;
>>>>>>>>>>> +    }
>>>>>>>>>>>    }
>>>>>>>>>>>
>>>>>>>>>>>    /**
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>>>>>>>>>>> index e64b33115848..0bdac923cb4d 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
>>>>>>>>>>> @@ -442,6 +442,7 @@ static u32 navi10_ih_get_wptr(struct
>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>        tmp = RREG32_NO_KIQ(ih_regs->ih_rb_cntl);
>>>>>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>> WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>>>>>        WREG32_NO_KIQ(ih_regs->ih_rb_cntl, tmp);
>>>>>>>>>>> +    ih->overflow = true;
>>>>>>>>>>>    out:
>>>>>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>>>>>>    }
>>>>>>>>>>> @@ -483,6 +484,7 @@ static void navi10_ih_irq_rearm(struct
>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>    static void navi10_ih_set_rptr(struct amdgpu_device *adev,
>>>>>>>>>>>                       struct amdgpu_ih_ring *ih)
>>>>>>>>>>>    {
>>>>>>>>>>> +    u32 tmp;
>>>>>>>>>>>        struct amdgpu_ih_regs *ih_regs;
>>>>>>>>>>>
>>>>>>>>>>>        if (ih == &adev->irq.ih_soft)
>>>>>>>>>>> @@ -499,6 +501,16 @@ static void navi10_ih_set_rptr(struct
>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>            ih_regs = &ih->ih_regs;
>>>>>>>>>>>            WREG32(ih_regs->ih_rb_rptr, ih->rptr);
>>>>>>>>>>>        }
>>>>>>>>>>> +
>>>>>>>>>>> +    /* If we overflowed previously (and thus set the
>>>>>>>>>>> OVERFLOW_CLEAR
>>>>>>>>>>> bit),
>>>>>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>>>>>> +     */
>>>>>>>>>>> +    if (ih->overflow) {
>>>>>>>>>>> +        tmp = RREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl);
>>>>>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>> WPTR_OVERFLOW_CLEAR,
>>>>>>>>>>> 0);
>>>>>>>>>>> + WREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl, tmp);
>>>>>>>>>>> +        ih->overflow = false;
>>>>>>>>>>> +    }
>>>>>>>>>>>    }
>>>>>>>>>>>
>>>>>>>>>>>    /**
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/si_ih.c
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/si_ih.c
>>>>>>>>>>> index 9a24f17a5750..ff35056d2b54 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/si_ih.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/si_ih.c
>>>>>>>>>>> @@ -119,6 +119,7 @@ static u32 si_ih_get_wptr(struct
>>>>>>>>>>> amdgpu_device
>>>>>>>>>>> *adev,
>>>>>>>>>>>            tmp = RREG32(IH_RB_CNTL);
>>>>>>>>>>>            tmp |= IH_RB_CNTL__WPTR_OVERFLOW_CLEAR_MASK;
>>>>>>>>>>>            WREG32(IH_RB_CNTL, tmp);
>>>>>>>>>>> +        ih->overflow = true;
>>>>>>>>>>>        }
>>>>>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>>>>>>    }
>>>>>>>>>>> @@ -147,7 +148,18 @@ static void si_ih_decode_iv(struct
>>>>>>>>>>> amdgpu_device
>>>>>>>>>>> *adev,
>>>>>>>>>>>    static void si_ih_set_rptr(struct amdgpu_device *adev,
>>>>>>>>>>>                   struct amdgpu_ih_ring *ih)
>>>>>>>>>>>    {
>>>>>>>>>>> +    u32 tmp;
>>>>>>>>>>> +
>>>>>>>>>>>        WREG32(IH_RB_RPTR, ih->rptr);
>>>>>>>>>>> +
>>>>>>>>>>> +    /* If we overflowed previously (and thus set the
>>>>>>>>>>> OVERFLOW_CLEAR
>>>>>>>>>>> bit),
>>>>>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>>>>>> +     */
>>>>>>>>>>> +    if (ih->overflow) {
>>>>>>>>>>> +        tmp = RREG32(IH_RB_CNTL);
>>>>>>>>>>> +        tmp &= ~IH_RB_CNTL__WPTR_OVERFLOW_CLEAR_MASK;
>>>>>>>>>>> +        WREG32(IH_RB_CNTL, tmp);
>>>>>>>>>>> +    }
>>>>>>>>>>>    }
>>>>>>>>>>>
>>>>>>>>>>>    static int si_ih_early_init(void *handle)
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
>>>>>>>>>>> index 917707bba7f3..6f5090d3db48 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
>>>>>>>>>>> @@ -218,6 +218,7 @@ static u32 tonga_ih_get_wptr(struct
>>>>>>>>>>> amdgpu_device
>>>>>>>>>>> *adev,
>>>>>>>>>>>        tmp = RREG32(mmIH_RB_CNTL);
>>>>>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>> WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>>>>>        WREG32(mmIH_RB_CNTL, tmp);
>>>>>>>>>>> +    ih->overflow = true;
>>>>>>>>>>>
>>>>>>>>>>>    out:
>>>>>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>>>>>> @@ -268,6 +269,8 @@ static void tonga_ih_decode_iv(struct
>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>    static void tonga_ih_set_rptr(struct amdgpu_device *adev,
>>>>>>>>>>>                      struct amdgpu_ih_ring *ih)
>>>>>>>>>>>    {
>>>>>>>>>>> +    u32 tmp;
>>>>>>>>>>> +
>>>>>>>>>>>        if (ih->use_doorbell) {
>>>>>>>>>>>            /* XXX check if swapping is necessary on BE */
>>>>>>>>>>>            *ih->rptr_cpu = ih->rptr;
>>>>>>>>>>> @@ -275,6 +278,16 @@ static void tonga_ih_set_rptr(struct
>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>        } else {
>>>>>>>>>>>            WREG32(mmIH_RB_RPTR, ih->rptr);
>>>>>>>>>>>        }
>>>>>>>>>>> +
>>>>>>>>>>> +    /* If we overflowed previously (and thus set the
>>>>>>>>>>> OVERFLOW_CLEAR
>>>>>>>>>>> bit),
>>>>>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>>>>>> +     */
>>>>>>>>>>> +    if (ih->overflow) {
>>>>>>>>>>> +        tmp = RREG32(mmIH_RB_CNTL);
>>>>>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>> WPTR_OVERFLOW_CLEAR,
>>>>>>>>>>> 0);
>>>>>>>>>>> +        WREG32(mmIH_RB_CNTL, tmp);
>>>>>>>>>>> +        ih->overflow = false;
>>>>>>>>>>> +    }
>>>>>>>>>>>    }
>>>>>>>>>>>
>>>>>>>>>>>    static int tonga_ih_early_init(void *handle)
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
>>>>>>>>>>> index d364c6dd152c..bb005924f194 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
>>>>>>>>>>> @@ -372,6 +372,7 @@ static u32 vega10_ih_get_wptr(struct
>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>        tmp = RREG32_NO_KIQ(ih_regs->ih_rb_cntl);
>>>>>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>> WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>>>>>        WREG32_NO_KIQ(ih_regs->ih_rb_cntl, tmp);
>>>>>>>>>>> +    ih->overflow = true;
>>>>>>>>>>>
>>>>>>>>>>>    out:
>>>>>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>>>>>> @@ -413,6 +414,7 @@ static void vega10_ih_irq_rearm(struct
>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>    static void vega10_ih_set_rptr(struct amdgpu_device *adev,
>>>>>>>>>>>                       struct amdgpu_ih_ring *ih)
>>>>>>>>>>>    {
>>>>>>>>>>> +    u32 tmp;
>>>>>>>>>>>        struct amdgpu_ih_regs *ih_regs;
>>>>>>>>>>>
>>>>>>>>>>>        if (ih == &adev->irq.ih_soft)
>>>>>>>>>>> @@ -429,6 +431,16 @@ static void vega10_ih_set_rptr(struct
>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>            ih_regs = &ih->ih_regs;
>>>>>>>>>>>            WREG32(ih_regs->ih_rb_rptr, ih->rptr);
>>>>>>>>>>>        }
>>>>>>>>>>> +
>>>>>>>>>>> +    /* If we overflowed previously (and thus set the
>>>>>>>>>>> OVERFLOW_CLEAR
>>>>>>>>>>> bit),
>>>>>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>>>>>> +     */
>>>>>>>>>>> +    if (ih->overflow) {
>>>>>>>>>>> +        tmp = RREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl);
>>>>>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>> WPTR_OVERFLOW_CLEAR,
>>>>>>>>>>> 0);
>>>>>>>>>>> + WREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl, tmp);
>>>>>>>>>>> +        ih->overflow = false;
>>>>>>>>>>> +    }
>>>>>>>>>>>    }
>>>>>>>>>>>
>>>>>>>>>>>    /**
>>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
>>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
>>>>>>>>>>> index ddfc6941f9d5..bb725a970697 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
>>>>>>>>>>> @@ -420,6 +420,7 @@ static u32 vega20_ih_get_wptr(struct
>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>        tmp = RREG32_NO_KIQ(ih_regs->ih_rb_cntl);
>>>>>>>>>>>        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>> WPTR_OVERFLOW_CLEAR, 1);
>>>>>>>>>>>        WREG32_NO_KIQ(ih_regs->ih_rb_cntl, tmp);
>>>>>>>>>>> +    ih->overflow = true;
>>>>>>>>>>>
>>>>>>>>>>>    out:
>>>>>>>>>>>        return (wptr & ih->ptr_mask);
>>>>>>>>>>> @@ -462,6 +463,7 @@ static void vega20_ih_irq_rearm(struct
>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>    static void vega20_ih_set_rptr(struct amdgpu_device *adev,
>>>>>>>>>>>                       struct amdgpu_ih_ring *ih)
>>>>>>>>>>>    {
>>>>>>>>>>> +    u32 tmp;
>>>>>>>>>>>        struct amdgpu_ih_regs *ih_regs;
>>>>>>>>>>>
>>>>>>>>>>>        if (ih == &adev->irq.ih_soft)
>>>>>>>>>>> @@ -478,6 +480,16 @@ static void vega20_ih_set_rptr(struct
>>>>>>>>>>> amdgpu_device *adev,
>>>>>>>>>>>            ih_regs = &ih->ih_regs;
>>>>>>>>>>>            WREG32(ih_regs->ih_rb_rptr, ih->rptr);
>>>>>>>>>>>        }
>>>>>>>>>>> +
>>>>>>>>>>> +    /* If we overflowed previously (and thus set the
>>>>>>>>>>> OVERFLOW_CLEAR
>>>>>>>>>>> bit),
>>>>>>>>>>> +     * reset it here to detect more overflows if they occur.
>>>>>>>>>>> +     */
>>>>>>>>>>> +    if (ih->overflow) {
>>>>>>>>>>> +        tmp = RREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl);
>>>>>>>>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL,
>>>>>>>>>>> WPTR_OVERFLOW_CLEAR,
>>>>>>>>>>> 0);
>>>>>>>>>>> + WREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl, tmp);
>>>>>>>>>>> +        ih->overflow = false;
>>>>>>>>>>> +    }
>>>>>>>>>>>    }
>>>>>>>>>>>
>>>>>>>>>>>    /**
>>>>>>>>>>> --
>>>>>>>>>>> 2.43.0
>>>>>>>>>>>
>>>
>