[PATCH] drm/amd: Add Suspend/Hibernate notification callback support
Mario Limonciello
mario.limonciello at amd.com
Fri Nov 22 15:38:00 UTC 2024
On 11/22/2024 08:28, Lazar, Lijo wrote:
>
>
> On 11/19/2024 1:33 AM, Mario Limonciello wrote:
>> As part of the suspend sequence VRAM needs to be evicted on dGPUs.
>> In order to make suspend/resume more reliable we moved this into
>> the pmops prepare() callback so that the suspend sequence would fail
>> but the system could remain operational under high memory usage suspend.
>>
>> Another class of issues exist though where due to memory fragementation
>> there isn't a large enough contiguous space and swap isn't accessible.
>>
>> Add support for a suspend/hibernate notification callback that could
>> evict VRAM before tasks are frozen. This should allow paging out to swap
>> if necessary.
>>
>> Link: https://github.com/ROCm/ROCK-Kernel-Driver/issues/174
>> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3476
>> Signed-off-by: Mario Limonciello <mario.limonciello at amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 36 ++++++++++++++++++++++
>> 2 files changed, 37 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index a37e687acbbc5..e70ca85151046 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -885,6 +885,7 @@ struct amdgpu_device {
>> bool need_swiotlb;
>> bool accel_working;
>> struct notifier_block acpi_nb;
>> + struct notifier_block pm_nb;
>> struct amdgpu_i2c_chan *i2c_bus[AMDGPU_MAX_I2C_BUS];
>> struct debugfs_blob_wrapper debugfs_vbios_blob;
>> struct debugfs_blob_wrapper debugfs_discovery_blob;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index b3ca911e55d61..5a4e9c7daf895 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -190,6 +190,8 @@ void amdgpu_set_init_level(struct amdgpu_device *adev,
>> }
>>
>> static inline void amdgpu_device_stop_pending_resets(struct amdgpu_device *adev);
>> +static int amdgpu_device_pm_notifier(struct notifier_block *nb, unsigned long mode,
>> + void *data);
>>
>> /**
>> * DOC: pcie_replay_count
>> @@ -4582,6 +4584,11 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>
>> amdgpu_device_check_iommu_direct_map(adev);
>>
>> + adev->pm_nb.notifier_call = amdgpu_device_pm_notifier;
>> + r = register_pm_notifier(&adev->pm_nb);
>> + if (r)
>> + goto failed;
>> +
>> return 0;
>>
>> release_ras_con:
>> @@ -4646,6 +4653,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
>> drain_workqueue(adev->mman.bdev.wq);
>> adev->shutdown = true;
>>
>> + unregister_pm_notifier(&adev->pm_nb);
>> +
>> /* make sure IB test finished before entering exclusive mode
>> * to avoid preemption on IB test
>> */
>> @@ -4777,6 +4786,33 @@ static int amdgpu_device_evict_resources(struct amdgpu_device *adev)
>> /*
>> * Suspend & resume.
>> */
>> +/**
>> + * amdgpu_device_pm_notifier - Notification block for Suspend/Hibernate events
>> + * @nb: notifier block
>> + * @mode: suspend mode
>> + * @data: data
>> + *
>> + * This function is called when the system is about to suspend or hibernate.
>> + * It is used to evict resources from the device before the system goes to
>> + * sleep while there is still access to swap.
>> + *
>> + */
>> +static int amdgpu_device_pm_notifier(struct notifier_block *nb, unsigned long mode,
>> + void *data)
>> +{
>> + struct amdgpu_device *adev = container_of(nb, struct amdgpu_device, pm_nb);
>> +
>> + switch (mode) {
>> + case PM_HIBERNATION_PREPARE:
>> + case PM_SUSPEND_PREPARE:
>> + if (amdgpu_device_evict_resources(adev))
>
> This will result in an eviction call on APUs since the flags won't be
> set by this time. Is that intended?
Very good catch! I will bump it and modify
amdgpu_device_evict_resources() to just skip APUs entirely.
>
> https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L4739
>
> Thanks,
> Lijo
>
>> + return NOTIFY_BAD;
>> + break;
>> + }
>> +
>> + return NOTIFY_DONE;
>> +}
>> +
>> /**
>> * amdgpu_device_prepare - prepare for device suspend
>> *
>
More information about the amd-gfx
mailing list