[PATCH v6 0/5] reduce system memory requirement for hibernation

Mario Limonciello mario.limonciello at amd.com
Thu Jul 10 15:57:37 UTC 2025


On 7/10/2025 2:23 AM, Samuel Zhang wrote:
> 
> Modern data center dGPUs are usually equipped with very large VRAM. On
> server with such dGPUs(192GB VRAM * 8) and 2TB system memory, hibernate
> will fail due to no enough free memory.
> 
> The root cause is that during hibernation all VRAM memory get evicted to
> GTT or shmem. In both case, it is in system memory and kernel will try to
> copy the pages to hibernation image. In the worst case, this causes 2
> copies of VRAM memory in system memory, 2TB is not enough for the
> hibernation image. 192GB * 8 * 2 = 3TB > 2TB.
> 
> The fix includes following changes. With these changes, there's much less
> pages needed to be copied to hibernate image and hibernation can succeed.
> * patch 1 and 2: move GTT to shmem after evicting VRAM. so that the GTT
>    pages can be freed.
> * patch 3: force write shmem pages to swap disk and free shmem pages.
> 
> 
> After swapout GTT to shmem in hibernation prepare stage, the GPU will be
> resumed again in thaw stage. The swapin and restore BOs of resume takes
> lots of time (50 mintues observed for 8 dGPUs). And it's unnecessary since
> writing hibernation image do not need GPU for hibernate successful case.
> * patch 4 and 5: skip resume of device in thaw stage for successful
>    hibernation case to reduce the hibernation time.
> 
> 
> v2:
> * split first patch to 2 patches, 1 for ttm, 1 for amdgpu
> * refined the new ttm api
> * add more comments for shrink_shmem_memory() and its callsite
> * export variable pm_transition in kernel
> * skip resume in thaw() for successful hibernation case
> v3:
> * refined ttm_device_prepare_hibernation() to accept device argument
> * use guard(mutex) to replace mutex_lock and mutex_unlock
> * move ttm_device_prepare_hibernation call to amdgpu_device_evict_resources()
> * add pm_transition_event(), instead of exporting pm_transition variable
> * refined amdgpu_pmops_thaw(), use switch-case for more clarity
> v4:
> * remove guard(mutex) and fix kdoc for ttm_device_prepare_hibernation
> * refined kdoc for pm_transition_event() and PM_EVENT_ messages
> * use dev_err in amdgpu_pmops_thaw()
> * add Reviewed-by and Acked-by for patch 2 3 and 5
> v5:
> * add Reviewed-by for patch 1
> * use pm_hibernate_is_recovering() to replace pm_transition_event()
> * check in_suspend in amdgpu_pmops_prepare() and amdgpu_pmops_poweroff()
> v6:
> * move pm_hibernate_is_recovering() from pm.h to suspend.h
> * rebase to next-20250709 tag of linux-next
> * add Tested-by for patch 5
> 
> 
> The merge options are either:
> * the linux-pm changes go to linux-pm and an immutable branch for drm to
>    merge
> * everything goes through amd-staging-drm-next (and an amdgpu PR to drm
>    later)
> * everything goes through drm-misc-next
> 
> Mario Limonciello think everything through drm-misc-next makes most sense
> if everyone is amenable.
> 

Applied, thanks.

530694f54dd5e (HEAD -> drm-misc-next, drm-misc/for-linux-next, 
drm-misc/drm-misc-next) drm/amdgpu: do not resume device in thaw for 
normal hibernation
c2aaddbd2dede PM: hibernate: add new api pm_hibernate_is_recovering()
2640e819474f4 PM: hibernate: shrink shmem pages after dev_pm_ops.prepare()
924dda024f3be drm/amdgpu: move GTT to shmem after eviction for hibernation
40b6a946d21ee drm/ttm: add new api ttm_device_prepare_hibernation()

> 
> Samuel Zhang (5):
> 1. drm/ttm: add new api ttm_device_prepare_hibernation()
> 2. drm/amdgpu: move GTT to shmem after eviction for hibernation
> 3. PM: hibernate: shrink shmem pages after dev_pm_ops.prepare()
> 4. PM: hibernate: add new api pm_hibernate_is_recovering()
> 5. drm/amdgpu: do not resume device in thaw for normal hibernation
> 
>   drivers/base/power/main.c                  | 14 ++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 ++++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 17 ++++++++++++++
>   drivers/gpu/drm/ttm/ttm_device.c           | 23 +++++++++++++++++++
>   include/drm/ttm/ttm_device.h               |  1 +
>   include/linux/suspend.h                    |  2 ++
>   kernel/power/hibernate.c                   | 26 ++++++++++++++++++++++
>   7 files changed, 92 insertions(+), 1 deletion(-)
> 



More information about the dri-devel mailing list