[PATCH] drm/amdgpu: reset asic after system-wide suspend aborted

Christian König ckoenig.leichtzumerken at gmail.com
Thu Nov 18 14:11:54 UTC 2021


Am 18.11.21 um 15:09 schrieb Lazar, Lijo:
> On 11/18/2021 7:36 PM, Alex Deucher wrote:
>> On Thu, Nov 18, 2021 at 8:11 AM Liang, Prike <Prike.Liang at amd.com> 
>> wrote:
>>>
>>> [Public]
>>>
>>>> -----Original Message-----
>>>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>>>> Sent: Thursday, November 18, 2021 4:01 PM
>>>> To: Liang, Prike <Prike.Liang at amd.com>; amd-gfx at lists.freedesktop.org
>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Huang, Ray
>>>> <Ray.Huang at amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu: reset asic after system-wide suspend
>>>> aborted
>>>>
>>>>
>>>>
>>>> On 11/18/2021 12:32 PM, Prike Liang wrote:
>>>>> Do ASIC reset at the moment Sx suspend aborted behind of amdgpu
>>>>> suspend to keep AMDGPU in a clean reset state and that can avoid
>>>>> re-initialize device improperly error.
>>>>>
>>>>> Signed-off-by: Prike Liang <Prike.Liang at amd.com>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++++
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 19
>>>> +++++++++++++++++++
>>>>>    3 files changed, 24 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> index b85b67a..8bd9833 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> @@ -1053,6 +1053,7 @@ struct amdgpu_device {
>>>>>      bool                            in_s3;
>>>>>      bool                            in_s4;
>>>>>      bool                            in_s0ix;
>>>>> +   bool                            pm_completed;
>>>>
>>>> PM framework maintains separate flags, why not use the same?
>>>>
>>>>           dev->power.is_suspended = false;
>>>>           dev->power.is_noirq_suspended = false;
>>>>           dev->power.is_late_suspended = false;
>>>>
>>>
>>> Thanks for pointing it out and I miss that flag. In this case we can 
>>> use the PM flag is_noirq_suspended for checking AMDGPU device 
>>> whether is finished suspend.
>>>
>>>> BTW, Alex posted a patch which does similar thing, though it tries 
>>>> reset if
>>>> suspend fails.
>>>>
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FDM6PR12MB26195F8E099407B4B6966FEBE4999%40&data=04%7C01%7CLijo.Lazar%40amd.com%7C6401dce9411b4c134b0208d9aa9ca644%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637728412055353107%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Z4dgNvHISHVlR4grHm1RU3%2FJMJVdRe7Ukq1DnjCgG0o%3D&reserved=0 
>>>>
>>>> DM6PR12MB2619.namprd12.prod.outlook.com/
>>>>
>>>> If that reset also failed, then no point in another reset, or keep 
>>>> it as part of
>>>> resume.
>>>>
>>>
>>> Alex's patch seems always do the ASIC reset at the end of AMDGPU 
>>> device, but that may needn't on the normal AMDGPU device suspend. 
>>> However, this patch shows that can handle the system suspend aborted 
>>> after AMDGPU suspend case well, so now seems we only need take care 
>>> suspend abort case here.
>>>
>>
>> Yeah, I was thinking we'd take this patch rather than mine to minimize
>> the number of resets.
>>
>
> Wondering if suspend fails whether there is a need to call resume. It 
> may not get to resume() to do a reset, that needs to be checked.

I would rather do it so that we reset the ASIC during resume when we 
detect that the hardware is in a bad state.

This way it should also work with things like kexec and virtualization.

Regards,
Christian.

>
> Thanks,
> Lijo
>
>
>> Alex
>>
>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>>
>>>>>      atomic_t                        in_gpu_reset;
>>>>>      enum pp_mp1_state               mp1_state;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index ec42a6f..a12ed54 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -3983,6 +3983,10 @@ int amdgpu_device_resume(struct drm_device
>>>> *dev, bool fbcon)
>>>>>      if (adev->in_s0ix)
>>>>>              amdgpu_gfx_state_change_set(adev,
>>>> sGpuChangeState_D0Entry);
>>>>>
>>>>> +   if (!adev->pm_completed) {
>>>>> +           dev_warn(adev->dev, "suspend aborted will do asic 
>>>>> reset\n");
>>>>> +           amdgpu_asic_reset(adev);
>>>>> +   }
>>>>>      /* post card */
>>>>>      if (amdgpu_device_need_post(adev)) {
>>>>>              r = amdgpu_device_asic_init(adev); diff --git
>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> index eee3cf8..9f90017 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>> @@ -2168,6 +2168,23 @@ static int amdgpu_pmops_suspend(struct
>>>> device *dev)
>>>>>      return r;
>>>>>    }
>>>>>
>>>>> +/*
>>>>> + * Actually the PM suspend whether is completed should be confirmed
>>>>> + * by checking the sysfs
>>>>> +sys/power/suspend_stats/failed_suspend.However,
>>>>> + * in this function only check the AMDGPU device whether is 
>>>>> suspended
>>>>> + * completely in the system-wide suspend process.
>>>>> + */
>>>>> +static int amdgpu_pmops_noirq_suspend(struct device *dev) {
>>>>> +   struct drm_device *drm_dev = dev_get_drvdata(dev);
>>>>> +   struct amdgpu_device *adev = drm_to_adev(drm_dev);
>>>>> +
>>>>> +   dev_dbg(dev, "amdgpu suspend completely.\n");
>>>>> +   adev->pm_completed = true;
>>>>> +
>>>>> +   return 0;
>>>>> +}
>>>>> +
>>>>>    static int amdgpu_pmops_resume(struct device *dev)
>>>>>    {
>>>>>      struct drm_device *drm_dev = dev_get_drvdata(dev); @@ -2181,6
>>>>> +2198,7 @@ static int amdgpu_pmops_resume(struct device *dev)
>>>>>      r = amdgpu_device_resume(drm_dev, true);
>>>>>      if (amdgpu_acpi_is_s0ix_active(adev))
>>>>>              adev->in_s0ix = false;
>>>>> +   adev->pm_completed = false;
>>>>>      return r;
>>>>>    }
>>>>>
>>>>> @@ -2397,6 +2415,7 @@ static const struct dev_pm_ops amdgpu_pm_ops
>>>> = {
>>>>>      .runtime_suspend = amdgpu_pmops_runtime_suspend,
>>>>>      .runtime_resume = amdgpu_pmops_runtime_resume,
>>>>>      .runtime_idle = amdgpu_pmops_runtime_idle,
>>>>> +   .suspend_noirq = amdgpu_pmops_noirq_suspend,
>>>>>    };
>>>>>
>>>>>    static int amdgpu_flush(struct file *f, fl_owner_t id)
>>>>>



More information about the amd-gfx mailing list