[PATCH] drm/amd: Fail the suspend if resources can't be evicted
Christian König
ckoenig.leichtzumerken at gmail.com
Thu Oct 27 14:56:49 UTC 2022
Am 26.10.22 um 21:03 schrieb Mario Limonciello:
> If a system does not have swap and memory is under 100% usage,
> amdgpu will fail to evict resources. Currently the suspend
> carries on proceeding to reset the GPU:
>
> ```
> [drm] evicting device resources failed
> [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vcn_v3_0> failed -12
> [drm] free PSP TMR buffer
> [TTM] Failed allocating page table
> [drm] evicting device resources failed
> amdgpu 0000:03:00.0: amdgpu: MODE1 reset
> amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
> amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
> ```
>
> At this point if the suspend actually succeeded I think that amdgpu
> would have recovered because the GPU would have power cut off and
> restored. However the kernel fails to continue the suspend from the
> memory pressure and amdgpu fails to run the "resume" from the aborted
> suspend.
>
> ```
> ACPI: PM: Preparing to enter system sleep state S3
> SLUB: Unable to allocate memory on node -1, gfp=0xdc0(GFP_KERNEL|__GFP_ZERO)
> cache: Acpi-State, object size: 80, buffer size: 80, default order: 0, min order: 0
> node 0: slabs: 22, objs: 1122, free: 0
> ACPI Error: AE_NO_MEMORY, Could not update object reference count (20210730/utdelete-651)
>
> [drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed!
> [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
> [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
> amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
> PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62
> amdgpu 0000:03:00.0: PM: failed to resume async: error -62
> ```
>
> To avoid this series of unfortunate events, fail amdgpu's suspend
> when the memory eviction fails. This will let the system gracefully
> recover and the user can try suspend again when the memory pressure
> is relieved.
>
> Reported-by: post at davidak.de
> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2223
> Signed-off-by: Mario Limonciello <mario.limonciello at amd.com>
Acked-by: Christian König <christian.koenig at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 ++++++++++-----
> 1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 6f958603c8cc2..ae10acede495e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4060,15 +4060,18 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
> * at suspend time.
> *
> */
> -static void amdgpu_device_evict_resources(struct amdgpu_device *adev)
> +static int amdgpu_device_evict_resources(struct amdgpu_device *adev)
> {
> + int ret;
> +
> /* No need to evict vram on APUs for suspend to ram or s2idle */
> if ((adev->in_s3 || adev->in_s0ix) && (adev->flags & AMD_IS_APU))
> - return;
> + return 0;
>
> - if (amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM))
> + ret = amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM);
> + if (ret)
> DRM_WARN("evicting device resources failed\n");
> -
> + return ret;
> }
>
> /*
> @@ -4118,7 +4121,9 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
> if (!adev->in_s0ix)
> amdgpu_amdkfd_suspend(adev, adev->in_runpm);
>
> - amdgpu_device_evict_resources(adev);
> + r = amdgpu_device_evict_resources(adev);
> + if (r)
> + return r;
>
> amdgpu_fence_driver_hw_fini(adev);
>
More information about the amd-gfx
mailing list