[PATCH 1/2] drm/amdgpu: Reset GPU on S0ix when device supports BOCO

Thu Mar 30 03:36:15 UTC 2023

On Wed, Mar 29, 2023 at 9:23 PM Mario Limonciello
<mario.limonciello at amd.com> wrote:
>
>
> On 3/29/23 04:59, Kai-Heng Feng wrote:
> > When the power is lost due to ACPI power resources being turned off, the
> > driver should reset the GPU so it can work anew.
> >
> > First, _PR3 support of the hierarchy needs to be found correctly. Since
> > the GPU on some discrete GFX cards is behind a PCIe switch, checking the
> > _PR3 on downstream port alone is not enough, as the _PR3 can associate
> > to the root port above the PCIe switch.
>
> I think this should be split into two commits:
>
> * One of them to look at _PR3 further up in hierarchy to fix indication
> for BOCO support.

Yes, this part can be split up.

>
> * One to adjust policy for whether to reset

IIUC, the GPU only needs to be reset when the power status isn't certain?

Assuming power resources in _PR3 are really disabled, GPU is already
reset by itself. That means reset shouldn't be necessary for D3cold,
am I understanding it correctly?

However, this is a desktop plugged with GFX card that has external
power, does that assumption still stand? Perform resetting on D3cold
can cover this scenario.

>
>
> > Once the _PR3 is found and BOCO support is correctly marked, use that
> > information to inform the GPU should be reset. This solves an issue that
> > system freeze on a Intel ADL desktop that uses S0ix for sleep and D3cold
> > is supported for the GFX slot.
>
> I'm worried this is still papering over an underlying issue with L0s
> handling on ALD + Navi1x/Navi2x.

Is it possible to get the ASIC's ASPM parameter under Windows? Knowing
the difference can be useful.

>
> Also, what about runtime suspend?  If you unplug the monitor from this
> dGPU and interact with it over SSH it should go into runtime suspend.
>
> Is it working properly for that case now?

Thanks for the tip. Runtime resume doesn't work at all:
[ 1087.601631] pcieport 0000:00:01.0: power state changed by ACPI to D0
[ 1087.613820] pcieport 0000:00:01.0: restoring config space at offset
0x2c (was 0x43, writing 0x43)
[ 1087.613835] pcieport 0000:00:01.0: restoring config space at offset
0x28 (was 0x41, writing 0x41)
[ 1087.613841] pcieport 0000:00:01.0: restoring config space at offset
0x24 (was 0xfff10001, writing 0xfff10001)
[ 1087.613978] pcieport 0000:00:01.0: PME# disabled
[ 1087.613984] pcieport 0000:00:01.0: waiting 100 ms for downstream
link, after activation
[ 1089.330956] pcieport 0000:01:00.0: not ready 1023ms after resume; giving up
[ 1089.373036] pcieport 0000:01:00.0: Unable to change power state
from D3cold to D0, device inaccessible

After a short while the whole system froze.

So the upstream port of GFX's PCIe switch cannot be powered on again.

Kai-Heng

>
> >
> > Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default")
> > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1885
> > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2458
> > Signed-off-by: Kai-Heng Feng <kai.heng.feng at canonical.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c   |  3 +++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  7 ++++++-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 12 +++++-------
> >   3 files changed, 14 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> > index 60b1857f469e..407456ac0e84 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> > @@ -987,6 +987,9 @@ bool amdgpu_acpi_should_gpu_reset(struct amdgpu_device *adev)
> >       if (amdgpu_sriov_vf(adev))
> >               return false;
> >
> > +     if (amdgpu_device_supports_boco(adev_to_drm(adev)))
> > +             return true;
> > +
> >   #if IS_ENABLED(CONFIG_SUSPEND)
> >       return pm_suspend_target_state != PM_SUSPEND_TO_IDLE;
> >   #else
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index f5658359ff5c..d56b7a2bafa6 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -2181,7 +2181,12 @@ static int amdgpu_device_ip_early_init(struct amdgpu_device *adev)
> >
> >       if (!(adev->flags & AMD_IS_APU)) {
> >               parent = pci_upstream_bridge(adev->pdev);
> > -             adev->has_pr3 = parent ? pci_pr3_present(parent) : false;
> > +             do {
> > +                     if (pci_pr3_present(parent)) {
> > +                             adev->has_pr3 = true;
> > +                             break;
> > +                     }
> > +             } while ((parent = pci_upstream_bridge(parent)));
> >       }
> >
> >       amdgpu_amdkfd_device_probe(adev);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index ba5def374368..5d81fcac4b0a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -2415,10 +2415,11 @@ static int amdgpu_pmops_suspend(struct device *dev)
> >       struct drm_device *drm_dev = dev_get_drvdata(dev);
> >       struct amdgpu_device *adev = drm_to_adev(drm_dev);
> >
> > -     if (amdgpu_acpi_is_s0ix_active(adev))
> > -             adev->in_s0ix = true;
> > -     else if (amdgpu_acpi_is_s3_active(adev))
> > +     if (amdgpu_acpi_is_s3_active(adev) ||
> > +         amdgpu_device_supports_boco(drm_dev))
> >               adev->in_s3 = true;
> > +     else if (amdgpu_acpi_is_s0ix_active(adev))
> > +             adev->in_s0ix = true;
> >       if (!adev->in_s0ix && !adev->in_s3)
> >               return 0;
> >       return amdgpu_device_suspend(drm_dev, true);
> > @@ -2449,10 +2450,7 @@ static int amdgpu_pmops_resume(struct device *dev)
> >               adev->no_hw_access = true;
> >
> >       r = amdgpu_device_resume(drm_dev, true);
> > -     if (amdgpu_acpi_is_s0ix_active(adev))
> > -             adev->in_s0ix = false;
> > -     else
> > -             adev->in_s3 = false;
> > +     adev->in_s0ix = adev->in_s3 = false;
> >       return r;
> >   }
> >