[PATCH 1/2] drm/amdgpu: Reset IH OVERFLOW_CLEAR bit after writing rptr

Wed Jan 17 23:00:20 UTC 2024

On Wed, Jan 17, 2024 at 7:36 AM Christian König
<ckoenig.leichtzumerken at gmail.com> wrote:
>
> Am 16.01.24 um 11:31 schrieb Friedrich Vock:
> > On 16.01.24 08:03, Christian König wrote:
> >> Am 15.01.24 um 12:18 schrieb Friedrich Vock:
> >>> [SNIP]
> >>>>> +    if (ih->overflow) {
> >>>>> +        tmp = RREG32(mmIH_RB_CNTL);
> >>>>> +        tmp &= ~IH_RB_CNTL__WPTR_OVERFLOW_CLEAR_MASK;
> >>>>> +        WREG32(mmIH_RB_CNTL, tmp);
> >>>>> +        ih->overflow = false;
> >>>>> +    }
> >>>>
> >>>> Well that is an extremely bad idea. We already reset the overflow
> >>>> after reading the WPTR.
> >>>
> >>> This is not resetting the overflow bit. This is resetting a "clear
> >>> overflow" bit. I don't have the hardware docs, but the name (and my
> >>> observations) strongly suggest that setting this bit actually prevents
> >>> the hardware from setting the overflow bit ever again.
> >>
> >> Well that doesn't make any sense at all. The hardware documentation
> >> clearly states that this bit is write only and should always read as
> >> zero.
> >>
> >> Setting this bit will clear the overflow flag in the WPTR register and
> >> clearing it has no effect at all.
> >>
> >> I could only ping the hw engineer responsible for this block to double
> >> check if the documentation is somehow outdated, but I really doubt so.
> >>
> > I see. I wish I had access to the documentation,
>
> Well, doesn't Valve has an NDA in place?
>
> > but I don't, so all I
> > can do is tell you what I observe the hardware doing. I've tested this
> > on both a Steam Deck (OSSYS 5.2.0) and an RX 6700 XT (OSSYS 5.0.3). On
> > both systems, launching a bunch of shaders that cause page faults leads
> > to lots of "[gfxhub] page fault" messages in dmesg, followed by an
> > "amdgpu: IH ring buffer overflow".
>
> Well that is certainly a bug, maybe even the same thing we have seen on
> Vega and MI.
>
> What we could do is to try to apply the same workaround to re-route the
> page faults to a different IH ring.
>
> See those patches here as well:
>
> commit 516bc3d8dd7965f1a8a3ea453857f14d95971e62
> Author: Christian König <christian.koenig at amd.com>
> Date:   Fri Nov 2 15:00:16 2018 +0100
>
>      drm/amdgpu: reroute VMC and UMD to IH ring 1
>
>      Page faults can easily overwhelm the interrupt handler.
>
>      So to make sure that we never lose valuable interrupts on the
> primary ring
>      we re-route page faults to IH ring 1.
>
> commit b849aaa41c914a0fd88003f88cb04420a873c624
> Author: Christian König <christian.koenig at amd.com>
> Date:   Mon Mar 4 19:34:34 2019 +0100
>
>      drm/amdgpu: also reroute VMC and UMD to IH ring 1 on Vega 20
>
>      Same patch we alredy did for Vega10. Just re-route page faults to a
> separate
>      ring to avoid drowning in interrupts.
>
> >
> > If I re-launch the same set of shaders after the GPU has soft-recovered,
> > the "amdgpu: IH ring buffer overflow" message is missing, even though
> > the same amount of page faults should've been triggered at roughly the
> > same rate. Running with this patch applied makes more "amdgpu: IH ring
> > buffer overflow" messages appear after relaunching the faulting shaders
> > (but not when processing any non-faulting work).
>
> That is actually the expected behavior. There should be a limit on the
> number of faults written to the ring so that the ring never overflows.
>
> >
> > The only possible conclusion I can draw from this is that clearing that
> > bit *does* have an effect, and I don't think it's far-fetched to assume
> > the IH ring buffer overflows still happen after re-launching the
> > faulting shaders but go undetected so far.
>
> Well that can only mean that the hw documentation is incorrect.
>
> Either the value is not write only trigger bit as documented or we need
> an additional read of the register for it to take effect or something
> like this.
>
> >>> Right now, IH overflows, even if they occur repeatedly, only get
> >>> registered once. If not registering IH overflows can trivially lead to
> >>> system crashes, it's amdgpu's current handling that is broken.
> >>
> >> It's years that we last tested this but according to the HW
> >> documentation this should work fine.
> >>
> >> What could potentially happen is that the IH has silenced the source
> >> of the overflow. We never implemented resetting those, but in this
> >> case that here won't help either.
> >>
> > If the IH silenced the page faults (which quite clearly cause the
> > overflow here), then how are the page faults still logged in dmesg?
>
> There should be a hardware rate limit for the page faults, e.g. there
> can only be X faults reported in N clock cycles and then a delay is
> inserted.

@Christian Koenig  Is that tied to xnack (i.e., noretry)?  The default
is noretry=1 on gfx10.3 and newer.  But it can be overridden.  It was
not set on some older kernels, maybe that is the problem?  @Friedrich
Vock does setting amdgpu.noretry=1 fix the issue?

Alex

>
> >>>
> >>> The possibility of a repeated IH overflow in between reading the wptr
> >>> and updating the rptr is a good point, but how can we detect that at
> >>> all? It seems to me like we can't set the OVERFLOW_CLEAR bit at all
> >>> then, because we're guaranteed to miss any overflows that happen while
> >>> the bit is set.
> >>
> >> When an IH overflow is signaled we clear that flag by writing 1 into
> >> the OVERFLOW_CLEAR bit and skip one entry in the IH ring buffer.
> >>
> >> What can of course happen is that the IH ring buffer overflows more
> >> than this single entry and we process IVs which are potentially
> >> corrupted, but we won't miss any additional overflows since we only
> >> start processing after resetting the flag.
> >>
> >> An IH overflow is also something you should *never* see in a
> >> production system. This is purely for driver bringup and as fallback
> >> when there is a severe incorrect programming of the HW.
> >>
> >> The only exception of that is page fault handling on MI products
> >> because of a hardware bug, to mitigate this we are processing page
> >> faults on a separate IH ring on those parts.
> >>
> >> On all other hw generations the IH should have some rate limit for the
> >> number of faults generated per second, so that the CPU is always able
> >> to catch up.
> >
> > I'm wondering if there is another bug in here somewhere. Your
> > explanation of how it's supposed to work makes a lot of sense, but from
> > what I can tell it doesn't work that way when I test it.
> >
> > From the printk_ratelimit stats it would seem like >2000 faults arrive
> > in less than a second, so perhaps your theory about fault interrupt
> > ratelimiting not working is correct (but it's hard for me to verify what
> > is going on without the documentation).
>
> I'm going to ping the relevant engineer and putting someone on the task
> to take a look.
>
> Thanks,
> Christian.
>
> >
> > Regards,
> > Friedrich
> >
> >>
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> Regards,
> >>> Friedrich
> >>>
> >>>>
> >>>> When you clear the overflow again when updating the RPTR you could
> >>>> loose another overflow which might have happened in between and so
> >>>> potentially process corrupted IVs.
> >>>>
> >>>> That can trivially crash the system.
> >>>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>>   }
> >>>>>
> >>>>>   static int cik_ih_early_init(void *handle)
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/cz_ih.c
> >>>>> b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
> >>>>> index b8c47e0cf37a..076559668573 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/cz_ih.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
> >>>>> @@ -215,7 +215,7 @@ static u32 cz_ih_get_wptr(struct amdgpu_device
> >>>>> *adev,
> >>>>>       tmp = RREG32(mmIH_RB_CNTL);
> >>>>>       tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR, 1);
> >>>>>       WREG32(mmIH_RB_CNTL, tmp);
> >>>>> -
> >>>>> +    ih->overflow = true;
> >>>>>
> >>>>>   out:
> >>>>>       return (wptr & ih->ptr_mask);
> >>>>> @@ -266,7 +266,19 @@ static void cz_ih_decode_iv(struct amdgpu_device
> >>>>> *adev,
> >>>>>   static void cz_ih_set_rptr(struct amdgpu_device *adev,
> >>>>>                  struct amdgpu_ih_ring *ih)
> >>>>>   {
> >>>>> +    u32 tmp;
> >>>>> +
> >>>>>       WREG32(mmIH_RB_RPTR, ih->rptr);
> >>>>> +
> >>>>> +    /* If we overflowed previously (and thus set the OVERFLOW_CLEAR
> >>>>> bit),
> >>>>> +     * reset it here to detect more overflows if they occur.
> >>>>> +     */
> >>>>> +    if (ih->overflow) {
> >>>>> +        tmp = RREG32(mmIH_RB_CNTL);
> >>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR,
> >>>>> 0);
> >>>>> +        WREG32(mmIH_RB_CNTL, tmp);
> >>>>> +        ih->overflow = false;
> >>>>> +    }
> >>>>>   }
> >>>>>
> >>>>>   static int cz_ih_early_init(void *handle)
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
> >>>>> b/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
> >>>>> index aecad530b10a..1a5e668643d1 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/iceland_ih.c
> >>>>> @@ -214,7 +214,7 @@ static u32 iceland_ih_get_wptr(struct
> >>>>> amdgpu_device *adev,
> >>>>>       tmp = RREG32(mmIH_RB_CNTL);
> >>>>>       tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR, 1);
> >>>>>       WREG32(mmIH_RB_CNTL, tmp);
> >>>>> -
> >>>>> +    ih->overflow = true;
> >>>>>
> >>>>>   out:
> >>>>>       return (wptr & ih->ptr_mask);
> >>>>> @@ -265,7 +265,19 @@ static void iceland_ih_decode_iv(struct
> >>>>> amdgpu_device *adev,
> >>>>>   static void iceland_ih_set_rptr(struct amdgpu_device *adev,
> >>>>>                   struct amdgpu_ih_ring *ih)
> >>>>>   {
> >>>>> +    u32 tmp;
> >>>>> +
> >>>>>       WREG32(mmIH_RB_RPTR, ih->rptr);
> >>>>> +
> >>>>> +    /* If we overflowed previously (and thus set the OVERFLOW_CLEAR
> >>>>> bit),
> >>>>> +     * reset it here to detect more overflows if they occur.
> >>>>> +     */
> >>>>> +    if (ih->overflow) {
> >>>>> +        tmp = RREG32(mmIH_RB_CNTL);
> >>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR,
> >>>>> 0);
> >>>>> +        WREG32(mmIH_RB_CNTL, tmp);
> >>>>> +        ih->overflow = false;
> >>>>> +    }
> >>>>>   }
> >>>>>
> >>>>>   static int iceland_ih_early_init(void *handle)
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
> >>>>> b/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
> >>>>> index d9ed7332d805..ce8f7feec713 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/ih_v6_0.c
> >>>>> @@ -418,6 +418,8 @@ static u32 ih_v6_0_get_wptr(struct amdgpu_device
> >>>>> *adev,
> >>>>>       tmp = RREG32_NO_KIQ(ih_regs->ih_rb_cntl);
> >>>>>       tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR, 1);
> >>>>>       WREG32_NO_KIQ(ih_regs->ih_rb_cntl, tmp);
> >>>>> +    ih->overflow = true;
> >>>>> +
> >>>>>   out:
> >>>>>       return (wptr & ih->ptr_mask);
> >>>>>   }
> >>>>> @@ -459,6 +461,7 @@ static void ih_v6_0_irq_rearm(struct
> >>>>> amdgpu_device *adev,
> >>>>>   static void ih_v6_0_set_rptr(struct amdgpu_device *adev,
> >>>>>                      struct amdgpu_ih_ring *ih)
> >>>>>   {
> >>>>> +    u32 tmp;
> >>>>>       struct amdgpu_ih_regs *ih_regs;
> >>>>>
> >>>>>       if (ih->use_doorbell) {
> >>>>> @@ -472,6 +475,16 @@ static void ih_v6_0_set_rptr(struct
> >>>>> amdgpu_device *adev,
> >>>>>           ih_regs = &ih->ih_regs;
> >>>>>           WREG32(ih_regs->ih_rb_rptr, ih->rptr);
> >>>>>       }
> >>>>> +
> >>>>> +    /* If we overflowed previously (and thus set the OVERFLOW_CLEAR
> >>>>> bit),
> >>>>> +     * reset it here to detect more overflows if they occur.
> >>>>> +     */
> >>>>> +    if (ih->overflow) {
> >>>>> +        tmp = RREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl);
> >>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR,
> >>>>> 0);
> >>>>> +        WREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl, tmp);
> >>>>> +        ih->overflow = false;
> >>>>> +    }
> >>>>>   }
> >>>>>
> >>>>>   /**
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/ih_v6_1.c
> >>>>> b/drivers/gpu/drm/amd/amdgpu/ih_v6_1.c
> >>>>> index 8fb05eae340a..668788ad34d9 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/ih_v6_1.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/ih_v6_1.c
> >>>>> @@ -418,6 +418,8 @@ static u32 ih_v6_1_get_wptr(struct amdgpu_device
> >>>>> *adev,
> >>>>>       tmp = RREG32_NO_KIQ(ih_regs->ih_rb_cntl);
> >>>>>       tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR, 1);
> >>>>>       WREG32_NO_KIQ(ih_regs->ih_rb_cntl, tmp);
> >>>>> +    ih->overflow = true;
> >>>>> +
> >>>>>   out:
> >>>>>       return (wptr & ih->ptr_mask);
> >>>>>   }
> >>>>> @@ -459,6 +461,7 @@ static void ih_v6_1_irq_rearm(struct
> >>>>> amdgpu_device *adev,
> >>>>>   static void ih_v6_1_set_rptr(struct amdgpu_device *adev,
> >>>>>                      struct amdgpu_ih_ring *ih)
> >>>>>   {
> >>>>> +    u32 tmp;
> >>>>>       struct amdgpu_ih_regs *ih_regs;
> >>>>>
> >>>>>       if (ih->use_doorbell) {
> >>>>> @@ -472,6 +475,16 @@ static void ih_v6_1_set_rptr(struct
> >>>>> amdgpu_device *adev,
> >>>>>           ih_regs = &ih->ih_regs;
> >>>>>           WREG32(ih_regs->ih_rb_rptr, ih->rptr);
> >>>>>       }
> >>>>> +
> >>>>> +    /* If we overflowed previously (and thus set the OVERFLOW_CLEAR
> >>>>> bit),
> >>>>> +     * reset it here to detect more overflows if they occur.
> >>>>> +     */
> >>>>> +    if (ih->overflow) {
> >>>>> +        tmp = RREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl);
> >>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR,
> >>>>> 0);
> >>>>> +        WREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl, tmp);
> >>>>> +        ih->overflow = false;
> >>>>> +    }
> >>>>>   }
> >>>>>
> >>>>>   /**
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
> >>>>> b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
> >>>>> index e64b33115848..0bdac923cb4d 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/navi10_ih.c
> >>>>> @@ -442,6 +442,7 @@ static u32 navi10_ih_get_wptr(struct
> >>>>> amdgpu_device *adev,
> >>>>>       tmp = RREG32_NO_KIQ(ih_regs->ih_rb_cntl);
> >>>>>       tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR, 1);
> >>>>>       WREG32_NO_KIQ(ih_regs->ih_rb_cntl, tmp);
> >>>>> +    ih->overflow = true;
> >>>>>   out:
> >>>>>       return (wptr & ih->ptr_mask);
> >>>>>   }
> >>>>> @@ -483,6 +484,7 @@ static void navi10_ih_irq_rearm(struct
> >>>>> amdgpu_device *adev,
> >>>>>   static void navi10_ih_set_rptr(struct amdgpu_device *adev,
> >>>>>                      struct amdgpu_ih_ring *ih)
> >>>>>   {
> >>>>> +    u32 tmp;
> >>>>>       struct amdgpu_ih_regs *ih_regs;
> >>>>>
> >>>>>       if (ih == &adev->irq.ih_soft)
> >>>>> @@ -499,6 +501,16 @@ static void navi10_ih_set_rptr(struct
> >>>>> amdgpu_device *adev,
> >>>>>           ih_regs = &ih->ih_regs;
> >>>>>           WREG32(ih_regs->ih_rb_rptr, ih->rptr);
> >>>>>       }
> >>>>> +
> >>>>> +    /* If we overflowed previously (and thus set the OVERFLOW_CLEAR
> >>>>> bit),
> >>>>> +     * reset it here to detect more overflows if they occur.
> >>>>> +     */
> >>>>> +    if (ih->overflow) {
> >>>>> +        tmp = RREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl);
> >>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR,
> >>>>> 0);
> >>>>> +        WREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl, tmp);
> >>>>> +        ih->overflow = false;
> >>>>> +    }
> >>>>>   }
> >>>>>
> >>>>>   /**
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/si_ih.c
> >>>>> b/drivers/gpu/drm/amd/amdgpu/si_ih.c
> >>>>> index 9a24f17a5750..ff35056d2b54 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/si_ih.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/si_ih.c
> >>>>> @@ -119,6 +119,7 @@ static u32 si_ih_get_wptr(struct amdgpu_device
> >>>>> *adev,
> >>>>>           tmp = RREG32(IH_RB_CNTL);
> >>>>>           tmp |= IH_RB_CNTL__WPTR_OVERFLOW_CLEAR_MASK;
> >>>>>           WREG32(IH_RB_CNTL, tmp);
> >>>>> +        ih->overflow = true;
> >>>>>       }
> >>>>>       return (wptr & ih->ptr_mask);
> >>>>>   }
> >>>>> @@ -147,7 +148,18 @@ static void si_ih_decode_iv(struct amdgpu_device
> >>>>> *adev,
> >>>>>   static void si_ih_set_rptr(struct amdgpu_device *adev,
> >>>>>                  struct amdgpu_ih_ring *ih)
> >>>>>   {
> >>>>> +    u32 tmp;
> >>>>> +
> >>>>>       WREG32(IH_RB_RPTR, ih->rptr);
> >>>>> +
> >>>>> +    /* If we overflowed previously (and thus set the OVERFLOW_CLEAR
> >>>>> bit),
> >>>>> +     * reset it here to detect more overflows if they occur.
> >>>>> +     */
> >>>>> +    if (ih->overflow) {
> >>>>> +        tmp = RREG32(IH_RB_CNTL);
> >>>>> +        tmp &= ~IH_RB_CNTL__WPTR_OVERFLOW_CLEAR_MASK;
> >>>>> +        WREG32(IH_RB_CNTL, tmp);
> >>>>> +    }
> >>>>>   }
> >>>>>
> >>>>>   static int si_ih_early_init(void *handle)
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
> >>>>> b/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
> >>>>> index 917707bba7f3..6f5090d3db48 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/tonga_ih.c
> >>>>> @@ -218,6 +218,7 @@ static u32 tonga_ih_get_wptr(struct amdgpu_device
> >>>>> *adev,
> >>>>>       tmp = RREG32(mmIH_RB_CNTL);
> >>>>>       tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR, 1);
> >>>>>       WREG32(mmIH_RB_CNTL, tmp);
> >>>>> +    ih->overflow = true;
> >>>>>
> >>>>>   out:
> >>>>>       return (wptr & ih->ptr_mask);
> >>>>> @@ -268,6 +269,8 @@ static void tonga_ih_decode_iv(struct
> >>>>> amdgpu_device *adev,
> >>>>>   static void tonga_ih_set_rptr(struct amdgpu_device *adev,
> >>>>>                     struct amdgpu_ih_ring *ih)
> >>>>>   {
> >>>>> +    u32 tmp;
> >>>>> +
> >>>>>       if (ih->use_doorbell) {
> >>>>>           /* XXX check if swapping is necessary on BE */
> >>>>>           *ih->rptr_cpu = ih->rptr;
> >>>>> @@ -275,6 +278,16 @@ static void tonga_ih_set_rptr(struct
> >>>>> amdgpu_device *adev,
> >>>>>       } else {
> >>>>>           WREG32(mmIH_RB_RPTR, ih->rptr);
> >>>>>       }
> >>>>> +
> >>>>> +    /* If we overflowed previously (and thus set the OVERFLOW_CLEAR
> >>>>> bit),
> >>>>> +     * reset it here to detect more overflows if they occur.
> >>>>> +     */
> >>>>> +    if (ih->overflow) {
> >>>>> +        tmp = RREG32(mmIH_RB_CNTL);
> >>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR,
> >>>>> 0);
> >>>>> +        WREG32(mmIH_RB_CNTL, tmp);
> >>>>> +        ih->overflow = false;
> >>>>> +    }
> >>>>>   }
> >>>>>
> >>>>>   static int tonga_ih_early_init(void *handle)
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
> >>>>> b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
> >>>>> index d364c6dd152c..bb005924f194 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c
> >>>>> @@ -372,6 +372,7 @@ static u32 vega10_ih_get_wptr(struct
> >>>>> amdgpu_device *adev,
> >>>>>       tmp = RREG32_NO_KIQ(ih_regs->ih_rb_cntl);
> >>>>>       tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR, 1);
> >>>>>       WREG32_NO_KIQ(ih_regs->ih_rb_cntl, tmp);
> >>>>> +    ih->overflow = true;
> >>>>>
> >>>>>   out:
> >>>>>       return (wptr & ih->ptr_mask);
> >>>>> @@ -413,6 +414,7 @@ static void vega10_ih_irq_rearm(struct
> >>>>> amdgpu_device *adev,
> >>>>>   static void vega10_ih_set_rptr(struct amdgpu_device *adev,
> >>>>>                      struct amdgpu_ih_ring *ih)
> >>>>>   {
> >>>>> +    u32 tmp;
> >>>>>       struct amdgpu_ih_regs *ih_regs;
> >>>>>
> >>>>>       if (ih == &adev->irq.ih_soft)
> >>>>> @@ -429,6 +431,16 @@ static void vega10_ih_set_rptr(struct
> >>>>> amdgpu_device *adev,
> >>>>>           ih_regs = &ih->ih_regs;
> >>>>>           WREG32(ih_regs->ih_rb_rptr, ih->rptr);
> >>>>>       }
> >>>>> +
> >>>>> +    /* If we overflowed previously (and thus set the OVERFLOW_CLEAR
> >>>>> bit),
> >>>>> +     * reset it here to detect more overflows if they occur.
> >>>>> +     */
> >>>>> +    if (ih->overflow) {
> >>>>> +        tmp = RREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl);
> >>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR,
> >>>>> 0);
> >>>>> +        WREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl, tmp);
> >>>>> +        ih->overflow = false;
> >>>>> +    }
> >>>>>   }
> >>>>>
> >>>>>   /**
> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
> >>>>> b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
> >>>>> index ddfc6941f9d5..bb725a970697 100644
> >>>>> --- a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
> >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
> >>>>> @@ -420,6 +420,7 @@ static u32 vega20_ih_get_wptr(struct
> >>>>> amdgpu_device *adev,
> >>>>>       tmp = RREG32_NO_KIQ(ih_regs->ih_rb_cntl);
> >>>>>       tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR, 1);
> >>>>>       WREG32_NO_KIQ(ih_regs->ih_rb_cntl, tmp);
> >>>>> +    ih->overflow = true;
> >>>>>
> >>>>>   out:
> >>>>>       return (wptr & ih->ptr_mask);
> >>>>> @@ -462,6 +463,7 @@ static void vega20_ih_irq_rearm(struct
> >>>>> amdgpu_device *adev,
> >>>>>   static void vega20_ih_set_rptr(struct amdgpu_device *adev,
> >>>>>                      struct amdgpu_ih_ring *ih)
> >>>>>   {
> >>>>> +    u32 tmp;
> >>>>>       struct amdgpu_ih_regs *ih_regs;
> >>>>>
> >>>>>       if (ih == &adev->irq.ih_soft)
> >>>>> @@ -478,6 +480,16 @@ static void vega20_ih_set_rptr(struct
> >>>>> amdgpu_device *adev,
> >>>>>           ih_regs = &ih->ih_regs;
> >>>>>           WREG32(ih_regs->ih_rb_rptr, ih->rptr);
> >>>>>       }
> >>>>> +
> >>>>> +    /* If we overflowed previously (and thus set the OVERFLOW_CLEAR
> >>>>> bit),
> >>>>> +     * reset it here to detect more overflows if they occur.
> >>>>> +     */
> >>>>> +    if (ih->overflow) {
> >>>>> +        tmp = RREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl);
> >>>>> +        tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR,
> >>>>> 0);
> >>>>> +        WREG32_NO_KIQ(ih->ih_regs.ih_rb_cntl, tmp);
> >>>>> +        ih->overflow = false;
> >>>>> +    }
> >>>>>   }
> >>>>>
> >>>>>   /**
> >>>>> --
> >>>>> 2.43.0
> >>>>>
> >>>>
> >>
>