[PATCH] drm/amdgpu: reset asic after system-wide suspend aborted
Lazar, Lijo
lijo.lazar at amd.com
Thu Nov 18 14:28:22 UTC 2021
On 11/18/2021 7:55 PM, Alex Deucher wrote:
> On Thu, Nov 18, 2021 at 9:15 AM Lazar, Lijo <lijo.lazar at amd.com> wrote:
>>
>>
>>
>> On 11/18/2021 7:41 PM, Christian König wrote:
>>> Am 18.11.21 um 15:09 schrieb Lazar, Lijo:
>>>> On 11/18/2021 7:36 PM, Alex Deucher wrote:
>>>>> On Thu, Nov 18, 2021 at 8:11 AM Liang, Prike <Prike.Liang at amd.com>
>>>>> wrote:
>>>>>>
>>>>>> [Public]
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>>>>>>> Sent: Thursday, November 18, 2021 4:01 PM
>>>>>>> To: Liang, Prike <Prike.Liang at amd.com>; amd-gfx at lists.freedesktop.org
>>>>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Huang, Ray
>>>>>>> <Ray.Huang at amd.com>
>>>>>>> Subject: Re: [PATCH] drm/amdgpu: reset asic after system-wide suspend
>>>>>>> aborted
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 11/18/2021 12:32 PM, Prike Liang wrote:
>>>>>>>> Do ASIC reset at the moment Sx suspend aborted behind of amdgpu
>>>>>>>> suspend to keep AMDGPU in a clean reset state and that can avoid
>>>>>>>> re-initialize device improperly error.
>>>>>>>>
>>>>>>>> Signed-off-by: Prike Liang <Prike.Liang at amd.com>
>>>>>>>> ---
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 19
>>>>>>> +++++++++++++++++++
>>>>>>>> 3 files changed, 24 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>>> index b85b67a..8bd9833 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>>> @@ -1053,6 +1053,7 @@ struct amdgpu_device {
>>>>>>>> bool in_s3;
>>>>>>>> bool in_s4;
>>>>>>>> bool in_s0ix;
>>>>>>>> + bool pm_completed;
>>>>>>>
>>>>>>> PM framework maintains separate flags, why not use the same?
>>>>>>>
>>>>>>> dev->power.is_suspended = false;
>>>>>>> dev->power.is_noirq_suspended = false;
>>>>>>> dev->power.is_late_suspended = false;
>>>>>>>
>>>>>>
>>>>>> Thanks for pointing it out and I miss that flag. In this case we can
>>>>>> use the PM flag is_noirq_suspended for checking AMDGPU device
>>>>>> whether is finished suspend.
>>>>>>
>>>>>>> BTW, Alex posted a patch which does similar thing, though it tries
>>>>>>> reset if
>>>>>>> suspend fails.
>>>>>>>
>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FDM6PR12MB26195F8E099407B4B6966FEBE4999%40&data=04%7C01%7Clijo.lazar%40amd.com%7C2ce211aeeeb448f6cb2c08d9aa9f4741%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637728423343483218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=nyzhGwTJV83YZkit34Bb%2B5tBxGEMvFzCyZPjz8eSDgc%3D&reserved=0
>>>>>>>
>>>>>>> DM6PR12MB2619.namprd12.prod.outlook.com/
>>>>>>>
>>>>>>> If that reset also failed, then no point in another reset, or keep
>>>>>>> it as part of
>>>>>>> resume.
>>>>>>>
>>>>>>
>>>>>> Alex's patch seems always do the ASIC reset at the end of AMDGPU
>>>>>> device, but that may needn't on the normal AMDGPU device suspend.
>>>>>> However, this patch shows that can handle the system suspend aborted
>>>>>> after AMDGPU suspend case well, so now seems we only need take care
>>>>>> suspend abort case here.
>>>>>>
>>>>>
>>>>> Yeah, I was thinking we'd take this patch rather than mine to minimize
>>>>> the number of resets.
>>>>>
>>>>
>>>> Wondering if suspend fails whether there is a need to call resume. It
>>>> may not get to resume() to do a reset, that needs to be checked.
>>>
>>> I would rather do it so that we reset the ASIC during resume when we
>>> detect that the hardware is in a bad state.
>>>
>>> This way it should also work with things like kexec and virtualization.
>>
>> I was thinking from the power framework perspective. If the device's
>> suspend() callback returns failure, why would the framework needs to
>> call a resume() on that device.
>
> The device's suspend callback succeeds, but some other device or event
> in the system causes the overall suspend to abort. As such the GPU is
> never powered off by the platform so it's left in a partially
> initialized state. The system then calls the resume() callbacks for
> all of the devices that were previously suspended to bring them back
> to a working system. GPU reset is just a convenient way to get the
> device back into a known good state. Unfortunately, I'm not sure if
> there is a good way to detect whether the GPU is in a known good state
> or not until we try and re-init the IPs and at that point the IPs are
> not fully initialized so it's complex to try and unwind that, reset,
> and try again. It's probably easiest to just reset the GPU on resume
> all the time. If the GPU was powered down, the reset should work fine
> since we are resetting a GPU that just came out of reset. If the GPU
> was not powered down, the reset will put the GPU into a known good
> state.
>
https://elixir.bootlin.com/linux/latest/source/drivers/base/power/main.c#L925
I don't have a machine to trace the path. The flag is set only if
suspend is succesful.
Thanks,
Lijo
> Alex
>
>>
>> Thanks,
>> Lijo
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>
>>>>> Alex
>>>>>
>>>>>
>>>>>>> Thanks,
>>>>>>> Lijo
>>>>>>>
>>>>>>>>
>>>>>>>> atomic_t in_gpu_reset;
>>>>>>>> enum pp_mp1_state mp1_state;
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> index ec42a6f..a12ed54 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> @@ -3983,6 +3983,10 @@ int amdgpu_device_resume(struct drm_device
>>>>>>> *dev, bool fbcon)
>>>>>>>> if (adev->in_s0ix)
>>>>>>>> amdgpu_gfx_state_change_set(adev,
>>>>>>> sGpuChangeState_D0Entry);
>>>>>>>>
>>>>>>>> + if (!adev->pm_completed) {
>>>>>>>> + dev_warn(adev->dev, "suspend aborted will do asic
>>>>>>>> reset\n");
>>>>>>>> + amdgpu_asic_reset(adev);
>>>>>>>> + }
>>>>>>>> /* post card */
>>>>>>>> if (amdgpu_device_need_post(adev)) {
>>>>>>>> r = amdgpu_device_asic_init(adev); diff --git
>>>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>> index eee3cf8..9f90017 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>> @@ -2168,6 +2168,23 @@ static int amdgpu_pmops_suspend(struct
>>>>>>> device *dev)
>>>>>>>> return r;
>>>>>>>> }
>>>>>>>>
>>>>>>>> +/*
>>>>>>>> + * Actually the PM suspend whether is completed should be confirmed
>>>>>>>> + * by checking the sysfs
>>>>>>>> +sys/power/suspend_stats/failed_suspend.However,
>>>>>>>> + * in this function only check the AMDGPU device whether is
>>>>>>>> suspended
>>>>>>>> + * completely in the system-wide suspend process.
>>>>>>>> + */
>>>>>>>> +static int amdgpu_pmops_noirq_suspend(struct device *dev) {
>>>>>>>> + struct drm_device *drm_dev = dev_get_drvdata(dev);
>>>>>>>> + struct amdgpu_device *adev = drm_to_adev(drm_dev);
>>>>>>>> +
>>>>>>>> + dev_dbg(dev, "amdgpu suspend completely.\n");
>>>>>>>> + adev->pm_completed = true;
>>>>>>>> +
>>>>>>>> + return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> static int amdgpu_pmops_resume(struct device *dev)
>>>>>>>> {
>>>>>>>> struct drm_device *drm_dev = dev_get_drvdata(dev); @@ -2181,6
>>>>>>>> +2198,7 @@ static int amdgpu_pmops_resume(struct device *dev)
>>>>>>>> r = amdgpu_device_resume(drm_dev, true);
>>>>>>>> if (amdgpu_acpi_is_s0ix_active(adev))
>>>>>>>> adev->in_s0ix = false;
>>>>>>>> + adev->pm_completed = false;
>>>>>>>> return r;
>>>>>>>> }
>>>>>>>>
>>>>>>>> @@ -2397,6 +2415,7 @@ static const struct dev_pm_ops amdgpu_pm_ops
>>>>>>> = {
>>>>>>>> .runtime_suspend = amdgpu_pmops_runtime_suspend,
>>>>>>>> .runtime_resume = amdgpu_pmops_runtime_resume,
>>>>>>>> .runtime_idle = amdgpu_pmops_runtime_idle,
>>>>>>>> + .suspend_noirq = amdgpu_pmops_noirq_suspend,
>>>>>>>> };
>>>>>>>>
>>>>>>>> static int amdgpu_flush(struct file *f, fl_owner_t id)
>>>>>>>>
>>>
More information about the amd-gfx
mailing list