[PATCH] drm/amd: Fail the suspend if resources can't be evicted

Christian König ckoenig.leichtzumerken at gmail.com
Thu Oct 27 14:56:49 UTC 2022


Am 26.10.22 um 21:03 schrieb Mario Limonciello:
> If a system does not have swap and memory is under 100% usage,
> amdgpu will fail to evict resources.  Currently the suspend
> carries on proceeding to reset the GPU:
>
> ```
> [drm] evicting device resources failed
> [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vcn_v3_0> failed -12
> [drm] free PSP TMR buffer
> [TTM] Failed allocating page table
> [drm] evicting device resources failed
> amdgpu 0000:03:00.0: amdgpu: MODE1 reset
> amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
> amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
> ```
>
> At this point if the suspend actually succeeded I think that amdgpu
> would have recovered because the GPU would have power cut off and
> restored.  However the kernel fails to continue the suspend from the
> memory pressure and amdgpu fails to run the "resume" from the aborted
> suspend.
>
> ```
> ACPI: PM: Preparing to enter system sleep state S3
> SLUB: Unable to allocate memory on node -1, gfp=0xdc0(GFP_KERNEL|__GFP_ZERO)
>    cache: Acpi-State, object size: 80, buffer size: 80, default order: 0, min order: 0
>    node 0: slabs: 22, objs: 1122, free: 0
> ACPI Error: AE_NO_MEMORY, Could not update object reference count (20210730/utdelete-651)
>
> [drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed!
> [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
> [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
> amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
> PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62
> amdgpu 0000:03:00.0: PM: failed to resume async: error -62
> ```
>
> To avoid this series of unfortunate events, fail amdgpu's suspend
> when the memory eviction fails.  This will let the system gracefully
> recover and the user can try suspend again when the memory pressure
> is relieved.
>
> Reported-by: post at davidak.de
> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2223
> Signed-off-by: Mario Limonciello <mario.limonciello at amd.com>

Acked-by: Christian König <christian.koenig at amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 ++++++++++-----
>   1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 6f958603c8cc2..ae10acede495e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4060,15 +4060,18 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>    * at suspend time.
>    *
>    */
> -static void amdgpu_device_evict_resources(struct amdgpu_device *adev)
> +static int amdgpu_device_evict_resources(struct amdgpu_device *adev)
>   {
> +	int ret;
> +
>   	/* No need to evict vram on APUs for suspend to ram or s2idle */
>   	if ((adev->in_s3 || adev->in_s0ix) && (adev->flags & AMD_IS_APU))
> -		return;
> +		return 0;
>   
> -	if (amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM))
> +	ret = amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM);
> +	if (ret)
>   		DRM_WARN("evicting device resources failed\n");
> -
> +	return ret;
>   }
>   
>   /*
> @@ -4118,7 +4121,9 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
>   	if (!adev->in_s0ix)
>   		amdgpu_amdkfd_suspend(adev, adev->in_runpm);
>   
> -	amdgpu_device_evict_resources(adev);
> +	r = amdgpu_device_evict_resources(adev);
> +	if (r)
> +		return r;
>   
>   	amdgpu_fence_driver_hw_fini(adev);
>   



More information about the amd-gfx mailing list