[PATCH 6/6] drm/amdgpu: fix fence fallback timer expired error

Thu May 8 06:53:34 UTC 2025

[AMD Official Use Only - AMD Internal Distribution Only]

Hi @Chang, HaiJun<mailto:HaiJun.Chang at amd.com>

Thank you for the info.
You are right, GFXMSIX_VECT0_ADDR_LO and GFXMSIX_VECT0_CONTROL registers are not restored on resume with new VF.

source VF, normal value
0x42000: 0xFEE00138 // GFXMSIX_VECT0_ADDR_LO
0x4200C: 0x00000000 // GFXMSIX_VECT0_CONTROL

destination VF, abnormal value
0x42000: 0x00000000 // GFXMSIX_VECT0_ADDR_LO
0x4200C: 0x00000001 // GFXMSIX_VECT0_CONTROL

Calling amdgpu_restore_msix() on resume can fix this issue. I will upload a new patch for this.

static int vega20_ih_resume(struct amdgpu_ip_block *ip_block)
{
+       struct amdgpu_device *adev = ip_block->adev;
+
+       if (amdgpu_xmgi_is_node_changed(adev))
+               amdgpu_restore_msix(adev);
        return vega20_ih_hw_init(ip_block);
}

Thanks
Sam

From: Chang, HaiJun <HaiJun.Chang at amd.com>
Date: Tuesday, April 29, 2025 at 10:43
To: Koenig, Christian <Christian.Koenig at amd.com>, Zhang, GuoQing (Sam) <GuoQing.Zhang at amd.com>, Christian König <ckoenig.leichtzumerken at gmail.com>, amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>, Deucher, Alexander <Alexander.Deucher at amd.com>
Cc: Zhao, Victor <Victor.Zhao at amd.com>, Deng, Emily <Emily.Deng at amd.com>, Zhang, Owen(SRDC) <Owen.Zhang2 at amd.com>
Subject: RE: [PATCH 6/6] drm/amdgpu: fix fence fallback timer expired error
[AMD Official Use Only - AMD Internal Distribution Only]

Hi,

The interrupt issue loss issue might be VF msix table isn't restored properly on resume.

The msix table in virtual machine is faked.  The real msix table will be programmed by QEMU when guest enable/disable msix interrupt.  But QEMU accessing VF msix table (GFXMSIX_* registers) could be blocked by nBIF protection if that time VF isn't in exclusive access.
We had a w/a in amdgpu driver to handle msix table loss case in amdgpu_restore_msix function.

Can you check the values of these GFXMSIX_* registers?  I think we should call amdgpu_restore_msix on resume to restore msix table.

Thanks,
HaiJun

-----Original Message-----
From: Koenig, Christian <Christian.Koenig at amd.com>
Sent: Monday, April 28, 2025 8:24 PM
To: Zhang, GuoQing (Sam) <GuoQing.Zhang at amd.com>; Christian König <ckoenig.leichtzumerken at gmail.com>; amd-gfx at lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher at amd.com>
Cc: Zhao, Victor <Victor.Zhao at amd.com>; Chang, HaiJun <HaiJun.Chang at amd.com>; Deng, Emily <Emily.Deng at amd.com>; Zhang, Owen(SRDC) <Owen.Zhang2 at amd.com>
Subject: Re: [PATCH 6/6] drm/amdgpu: fix fence fallback timer expired error

On 4/24/25 05:38, Zhang, GuoQing (Sam) wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>
> Ping… @Koenig, Christian <mailto:Christian.Koenig at amd.com>
>
> Thanks
>
> Sam
>
> *From: *amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of
> Zhang, GuoQing (Sam) <GuoQing.Zhang at amd.com>
> *Date: *Wednesday, April 23, 2025 at 14:59
> *To: *Christian König <ckoenig.leichtzumerken at gmail.com>, amd-
> gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
> *Cc: *Zhao, Victor <Victor.Zhao at amd.com>, Chang, HaiJun
> <HaiJun.Chang at amd.com>, Deng, Emily <Emily.Deng at amd.com>, Zhang,
> Owen(SRDC) <Owen.Zhang2 at amd.com>
> *Subject: *Re: [PATCH 6/6] drm/amdgpu: fix fence fallback timer
> expired error
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Hi @Christian König <mailto:ckoenig.leichtzumerken at gmail.com>,
>
> On QEMU VM environment, when request_irq() is called in guest KMD,
> QEMU will enable interrupt for the device on the host.
>
> When hibernate and resume with a new vGPU without calling
> request_irq() on the new vGPU, the interrupt of the new vGPU is not
> enabled. The IH handler in guest KMD will not be called in this case.
>
> This change is to ensure request_irq() is called on resume for the new vGPUs.

That doesn't make sense.

The MSI state is saved and restored by the core OS on suspend and resume, drivers should never mess with that.

If this doesn't work with the new vGPU for some reason then that is not something we can work around inside the driver.

Which state exactly isn't restored here?

Regards,
Christian.

>
> Regards
>
> Sam
>
> *From: *Christian König <ckoenig.leichtzumerken at gmail.com>
> *Date: *Wednesday, April 16, 2025 at 21:54
> *To: *Zhang, GuoQing (Sam) <GuoQing.Zhang at amd.com>, amd-
> gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
> *Cc: *Zhao, Victor <Victor.Zhao at amd.com>, Chang, HaiJun
> <HaiJun.Chang at amd.com>, Deng, Emily <Emily.Deng at amd.com>
> *Subject: *Re: [PATCH 6/6] drm/amdgpu: fix fence fallback timer
> expired error
>
> Am 14.04.25 um 12:46 schrieb Samuel Zhang:
>  > IH is not working after switching a new gpu index for the first time.
>  > IH handler function need to be re-registered with kernel after
> switching  > to new gpu index.
>
> Why?
>
> Christian.
>
>  >
>  > Signed-off-by: Samuel Zhang <guoqing.zhang at amd.com>  > Change-Id:
> Idece1c8fce24032fd08f5a8b6ac23793c51e56dd
>  > ---
>  >  drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c |  7 +++++--  >
> drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h |  1 +  >
> drivers/gpu/drm/amd/amdgpu/vega20_ih.c  | 18 ++++++++++++++++--  >  3
> files changed, 22 insertions(+), 4 deletions(-)  >  > diff --git
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c b/drivers/gpu/drm/amd/
> amdgpu/amdgpu_irq.c  > index 19ce4da285e8..2292245a0c5d 100644  > ---
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>  > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>  > @@ -326,7 +326,7 @@ int amdgpu_irq_init(struct amdgpu_device *adev)
>  >        return r;
>  >  }
>  >
>  > -void amdgpu_irq_fini_hw(struct amdgpu_device *adev)  > +void
> amdgpu_irq_uninstall(struct amdgpu_device *adev)  >  {
>  >        if (adev->irq.installed) {
>  >                free_irq(adev->irq.irq, adev_to_drm(adev));
>  > @@ -334,7 +334,10 @@ void amdgpu_irq_fini_hw(struct amdgpu_device *adev)
>  >                if (adev->irq.msi_enabled)
>  >                        pci_free_irq_vectors(adev->pdev);
>  >        }
>  > -
>  > +}
>  > +void amdgpu_irq_fini_hw(struct amdgpu_device *adev)  > +{
>  > +     amdgpu_irq_uninstall(adev);
>  >        amdgpu_ih_ring_fini(adev, &adev->irq.ih_soft);
>  >        amdgpu_ih_ring_fini(adev, &adev->irq.ih);
>  >        amdgpu_ih_ring_fini(adev, &adev->irq.ih1);
>  > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h
> b/drivers/gpu/drm/amd/ amdgpu/amdgpu_irq.h  > index
> 04c0b4fa17a4..c6e6681b4f71 100644  > ---
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h
>  > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.h
>  > @@ -123,6 +123,7 @@ extern const int
> node_id_to_phys_map[NODEID_MAX];  >  void
> amdgpu_irq_disable_all(struct amdgpu_device *adev);  >  >  int
> amdgpu_irq_init(struct amdgpu_device *adev);  > +void
> amdgpu_irq_uninstall(struct amdgpu_device *adev);  >  void
> amdgpu_irq_fini_sw(struct amdgpu_device *adev);  >  void
> amdgpu_irq_fini_hw(struct amdgpu_device *adev);  >  int
> amdgpu_irq_add_id(struct amdgpu_device *adev,  > diff --git
> a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c b/drivers/gpu/drm/amd/
> amdgpu/vega20_ih.c  > index faa0dd75dd6d..ef996505e4dc 100644  > ---
> a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
>  > +++ b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c
>  > @@ -643,12 +643,26 @@ static int vega20_ih_hw_fini(struct
> amdgpu_ip_block
> *ip_block)
>  >
>  >  static int vega20_ih_suspend(struct amdgpu_ip_block *ip_block)  >
> {
>  > -     return vega20_ih_hw_fini(ip_block);
>  > +     struct amdgpu_device *adev = ip_block->adev;
>  > +     int r = 0;
>  > +
>  > +     r = vega20_ih_hw_fini(ip_block);
>  > +     amdgpu_irq_uninstall(adev);
>  > +     return r;
>  >  }
>  >
>  >  static int vega20_ih_resume(struct amdgpu_ip_block *ip_block)  >
> {
>  > -     return vega20_ih_hw_init(ip_block);
>  > +     struct amdgpu_device *adev = ip_block->adev;
>  > +     int r = 0;
>  > +
>  > +     r = amdgpu_irq_init(adev);
>  > +     if (r) {
>  > +             dev_err(adev->dev, "amdgpu_irq_init failed in %s, %d\n",
> __func__, r);
>  > +             return r;
>  > +     }
>  > +     r = vega20_ih_hw_init(ip_block);
>  > +     return r;
>  >  }
>  >
>  >  static bool vega20_ih_is_idle(struct amdgpu_ip_block *ip_block)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20250508/aad3411a/attachment-0001.htm>