[PATCH] drm/amd: Fail the suspend if resources can't be evicted

Thu Oct 27 13:39:47 UTC 2022

Patch is:
Reviewed-by: Alex Deucher <alexander.deucher at amd.com>

On Thu, Oct 27, 2022 at 9:39 AM Alex Deucher <alexdeucher at gmail.com> wrote:
>
> On Thu, Oct 27, 2022 at 2:31 AM Christian König
> <christian.koenig at amd.com> wrote:
> >
> > Am 26.10.22 um 21:03 schrieb Mario Limonciello:
> > > If a system does not have swap and memory is under 100% usage,
> > > amdgpu will fail to evict resources.  Currently the suspend
> > > carries on proceeding to reset the GPU:
> > >
> > > ```
> > > [drm] evicting device resources failed
> > > [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vcn_v3_0> failed -12
> > > [drm] free PSP TMR buffer
> > > [TTM] Failed allocating page table
> > > [drm] evicting device resources failed
> > > amdgpu 0000:03:00.0: amdgpu: MODE1 reset
> > > amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
> > > amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
> > > ```
> > >
> > > At this point if the suspend actually succeeded I think that amdgpu
> > > would have recovered because the GPU would have power cut off and
> > > restored.  However the kernel fails to continue the suspend from the
> > > memory pressure and amdgpu fails to run the "resume" from the aborted
> > > suspend.
> > >
> > > ```
> > > ACPI: PM: Preparing to enter system sleep state S3
> > > SLUB: Unable to allocate memory on node -1, gfp=0xdc0(GFP_KERNEL|__GFP_ZERO)
> > >    cache: Acpi-State, object size: 80, buffer size: 80, default order: 0, min order: 0
> > >    node 0: slabs: 22, objs: 1122, free: 0
> > > ACPI Error: AE_NO_MEMORY, Could not update object reference count (20210730/utdelete-651)
> > >
> > > [drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed!
> > > [drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
> > > [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
> > > amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
> > > PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62
> > > amdgpu 0000:03:00.0: PM: failed to resume async: error -62
> > > ```
> > >
> > > To avoid this series of unfortunate events, fail amdgpu's suspend
> > > when the memory eviction fails.  This will let the system gracefully
> > > recover and the user can try suspend again when the memory pressure
> > > is relieved.
> >
> > Yeah, I've been thinking about that handling for a while now as well.
> >
> > Failing to suspend when we are OOM is certainly the right thing to do
> > from a technical perspective.
> >
> > But it also means that when users close their laptop it can happen that
> > it keeps running and draining the battery.
> >
> > On the other hand when you don't have swap configured it's your fault
> > and not the drivers.
> >
> > It's a trade off and I'm not sure what's better. Alex any comment here?
>
> Probably better handled in userspace as Mario noted.  Also, at least
> the whole driver won't fall apart due to important buffers getting
> lost.
>
> Alex
>
>
> >
> > Thanks,
> > Christian.
> >
> > >
> > > Reported-by: post at davidak.de
> > > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2223
> > > Signed-off-by: Mario Limonciello <mario.limonciello at amd.com>
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 ++++++++++-----
> > >   1 file changed, 10 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index 6f958603c8cc2..ae10acede495e 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -4060,15 +4060,18 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
> > >    * at suspend time.
> > >    *
> > >    */
> > > -static void amdgpu_device_evict_resources(struct amdgpu_device *adev)
> > > +static int amdgpu_device_evict_resources(struct amdgpu_device *adev)
> > >   {
> > > +     int ret;
> > > +
> > >       /* No need to evict vram on APUs for suspend to ram or s2idle */
> > >       if ((adev->in_s3 || adev->in_s0ix) && (adev->flags & AMD_IS_APU))
> > > -             return;
> > > +             return 0;
> > >
> > > -     if (amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM))
> > > +     ret = amdgpu_ttm_evict_resources(adev, TTM_PL_VRAM);
> > > +     if (ret)
> > >               DRM_WARN("evicting device resources failed\n");
> > > -
> > > +     return ret;
> > >   }
> > >
> > >   /*
> > > @@ -4118,7 +4121,9 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
> > >       if (!adev->in_s0ix)
> > >               amdgpu_amdkfd_suspend(adev, adev->in_runpm);
> > >
> > > -     amdgpu_device_evict_resources(adev);
> > > +     r = amdgpu_device_evict_resources(adev);
> > > +     if (r)
> > > +             return r;
> > >
> > >       amdgpu_fence_driver_hw_fini(adev);
> > >
> >