[PATCH v4 1/8] drm/amdgpu: Avoid accessing HW when suspending SW state
Bjorn Helgaas
helgaas at kernel.org
Wed Sep 2 21:56:12 UTC 2020
On Wed, Sep 02, 2020 at 02:42:03PM -0400, Andrey Grodzovsky wrote:
> At this point the ASIC is already post reset by the HW/PSP
> so the HW not in proper state to be configured for suspension,
> some blocks might be even gated and so best is to avoid touching it.
>
> v2: Rename in_dpc to more meaningful name
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
> Reviewed-by: Alex Deucher <alexander.deucher at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38 ++++++++++++++++++++++++++++++
> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 6 +++++
> drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 6 +++++
> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 18 ++++++++------
> drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 3 +++
> 6 files changed, 65 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index c311a3c..b20354f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -992,6 +992,7 @@ struct amdgpu_device {
> atomic_t throttling_logging_enabled;
> struct ratelimit_state throttling_logging_rs;
> uint32_t ras_features;
> + bool in_pci_err_recovery;
> };
>
> static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 74a1c03..1fbf8a1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -319,6 +319,9 @@ uint32_t amdgpu_mm_rreg(struct amdgpu_device *adev, uint32_t reg,
> {
> uint32_t ret;
>
> + if (adev->in_pci_err_recovery)
> + return 0;
I don't know the whole scheme of this, but this looks racy.
It looks like the normal path through this function is the readl(),
which I assume is an MMIO read from the PCI device. If this is called
after a PCI error occurs, but before amdgpu_pci_slot_reset() sets
adev->in_pci_err_recovery, the readl() will return 0xffffffff.
If this is called after amdgpu_pci_slot_reset() sets
adev->in_pci_err_recovery, it will return 0. Do you really want those
two different cases?
> if (!(acc_flags & AMDGPU_REGS_NO_KIQ) && amdgpu_sriov_runtime(adev))
> return amdgpu_kiq_rreg(adev, reg);
> @@ -4773,7 +4809,9 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>
> pci_restore_state(pdev);
>
> + adev->in_pci_err_recovery = true;
> r = amdgpu_device_ip_suspend(adev);
> + adev->in_pci_err_recovery = false;
> if (r)
> goto out;
More information about the amd-gfx
mailing list