[PATCH] drm/amd: Add Suspend/Hibernate notification callback support

Mario Limonciello mario.limonciello at amd.com
Fri Nov 22 16:00:51 UTC 2024


On 11/22/2024 09:59, Alex Deucher wrote:
> On Fri, Nov 22, 2024 at 10:57 AM Mario Limonciello
> <mario.limonciello at amd.com> wrote:
>>
>> On 11/22/2024 08:28, Lazar, Lijo wrote:
>>>
>>>
>>> On 11/19/2024 1:33 AM, Mario Limonciello wrote:
>>>> As part of the suspend sequence VRAM needs to be evicted on dGPUs.
>>>> In order to make suspend/resume more reliable we moved this into
>>>> the pmops prepare() callback so that the suspend sequence would fail
>>>> but the system could remain operational under high memory usage suspend.
>>>>
>>>> Another class of issues exist though where due to memory fragementation
>>>> there isn't a large enough contiguous space and swap isn't accessible.
>>>>
>>>> Add support for a suspend/hibernate notification callback that could
>>>> evict VRAM before tasks are frozen. This should allow paging out to swap
>>>> if necessary.
>>>>
>>>> Link: https://github.com/ROCm/ROCK-Kernel-Driver/issues/174
>>>> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3476
>>>> Signed-off-by: Mario Limonciello <mario.limonciello at amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 36 ++++++++++++++++++++++
>>>>    2 files changed, 37 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> index a37e687acbbc5..e70ca85151046 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>> @@ -885,6 +885,7 @@ struct amdgpu_device {
>>>>       bool                            need_swiotlb;
>>>>       bool                            accel_working;
>>>>       struct notifier_block           acpi_nb;
>>>> +    struct notifier_block           pm_nb;
>>>>       struct amdgpu_i2c_chan          *i2c_bus[AMDGPU_MAX_I2C_BUS];
>>>>       struct debugfs_blob_wrapper     debugfs_vbios_blob;
>>>>       struct debugfs_blob_wrapper     debugfs_discovery_blob;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index b3ca911e55d61..5a4e9c7daf895 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -190,6 +190,8 @@ void amdgpu_set_init_level(struct amdgpu_device *adev,
>>>>    }
>>>>
>>>>    static inline void amdgpu_device_stop_pending_resets(struct amdgpu_device *adev);
>>>> +static int amdgpu_device_pm_notifier(struct notifier_block *nb, unsigned long mode,
>>>> +                                 void *data);
>>>>
>>>>    /**
>>>>     * DOC: pcie_replay_count
>>>> @@ -4582,6 +4584,11 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>>>
>>>>       amdgpu_device_check_iommu_direct_map(adev);
>>>>
>>>> +    adev->pm_nb.notifier_call = amdgpu_device_pm_notifier;
>>>> +    r = register_pm_notifier(&adev->pm_nb);
>>>> +    if (r)
>>>> +            goto failed;
>>>> +
>>>>       return 0;
>>>>
>>>>    release_ras_con:
>>>> @@ -4646,6 +4653,8 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
>>>>               drain_workqueue(adev->mman.bdev.wq);
>>>>       adev->shutdown = true;
>>>>
>>>> +    unregister_pm_notifier(&adev->pm_nb);
>>>> +
>>>>       /* make sure IB test finished before entering exclusive mode
>>>>        * to avoid preemption on IB test
>>>>        */
>>>> @@ -4777,6 +4786,33 @@ static int amdgpu_device_evict_resources(struct amdgpu_device *adev)
>>>>    /*
>>>>     * Suspend & resume.
>>>>     */
>>>> +/**
>>>> + * amdgpu_device_pm_notifier - Notification block for Suspend/Hibernate events
>>>> + * @nb: notifier block
>>>> + * @mode: suspend mode
>>>> + * @data: data
>>>> + *
>>>> + * This function is called when the system is about to suspend or hibernate.
>>>> + * It is used to evict resources from the device before the system goes to
>>>> + * sleep while there is still access to swap.
>>>> + *
>>>> + */
>>>> +static int amdgpu_device_pm_notifier(struct notifier_block *nb, unsigned long mode,
>>>> +                                 void *data)
>>>> +{
>>>> +    struct amdgpu_device *adev = container_of(nb, struct amdgpu_device, pm_nb);
>>>> +
>>>> +    switch (mode) {
>>>> +    case PM_HIBERNATION_PREPARE:
>>>> +    case PM_SUSPEND_PREPARE:
>>>> +            if (amdgpu_device_evict_resources(adev))
>>>
>>> This will result in an eviction call on APUs since the flags won't be
>>> set by this time. Is that intended?
>>
>> Very good catch!  I will bump it and modify
>> amdgpu_device_evict_resources() to just skip APUs entirely.
> 
> We can skip for suspend on APUs, but we need to keep it for hibernation.

Right; I realized that when I looked at the code, and that's what I 
ended up doing for v2.

> 
> Alex
> 
>>
>>>
>>> https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L4739
>>>
>>> Thanks,
>>> Lijo
>>>
>>>> +                    return NOTIFY_BAD;
>>>> +            break;
>>>> +    }
>>>> +
>>>> +    return NOTIFY_DONE;
>>>> +}
>>>> +
>>>>    /**
>>>>     * amdgpu_device_prepare - prepare for device suspend
>>>>     *
>>>
>>



More information about the amd-gfx mailing list