[PATCH] drm/amdgpu: reset asic after system-wide suspend aborted

Lazar, Lijo lijo.lazar at amd.com
Thu Nov 18 14:14:42 UTC 2021



On 11/18/2021 7:41 PM, Christian König wrote:
> Am 18.11.21 um 15:09 schrieb Lazar, Lijo:
>> On 11/18/2021 7:36 PM, Alex Deucher wrote:
>>> On Thu, Nov 18, 2021 at 8:11 AM Liang, Prike <Prike.Liang at amd.com> 
>>> wrote:
>>>>
>>>> [Public]
>>>>
>>>>> -----Original Message-----
>>>>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>>>>> Sent: Thursday, November 18, 2021 4:01 PM
>>>>> To: Liang, Prike <Prike.Liang at amd.com>; amd-gfx at lists.freedesktop.org
>>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Huang, Ray
>>>>> <Ray.Huang at amd.com>
>>>>> Subject: Re: [PATCH] drm/amdgpu: reset asic after system-wide suspend
>>>>> aborted
>>>>>
>>>>>
>>>>>
>>>>> On 11/18/2021 12:32 PM, Prike Liang wrote:
>>>>>> Do ASIC reset at the moment Sx suspend aborted behind of amdgpu
>>>>>> suspend to keep AMDGPU in a clean reset state and that can avoid
>>>>>> re-initialize device improperly error.
>>>>>>
>>>>>> Signed-off-by: Prike Liang <Prike.Liang at amd.com>
>>>>>> ---
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++++
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 19
>>>>> +++++++++++++++++++
>>>>>>    3 files changed, 24 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>> index b85b67a..8bd9833 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>> @@ -1053,6 +1053,7 @@ struct amdgpu_device {
>>>>>>      bool                            in_s3;
>>>>>>      bool                            in_s4;
>>>>>>      bool                            in_s0ix;
>>>>>> +   bool                            pm_completed;
>>>>>
>>>>> PM framework maintains separate flags, why not use the same?
>>>>>
>>>>>           dev->power.is_suspended = false;
>>>>>           dev->power.is_noirq_suspended = false;
>>>>>           dev->power.is_late_suspended = false;
>>>>>
>>>>
>>>> Thanks for pointing it out and I miss that flag. In this case we can 
>>>> use the PM flag is_noirq_suspended for checking AMDGPU device 
>>>> whether is finished suspend.
>>>>
>>>>> BTW, Alex posted a patch which does similar thing, though it tries 
>>>>> reset if
>>>>> suspend fails.
>>>>>
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FDM6PR12MB26195F8E099407B4B6966FEBE4999%40&data=04%7C01%7Clijo.lazar%40amd.com%7C79c861fcfccc4f2cc45d08d9aa9d61f0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637728415203834017%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=gGnctl8rlNTj6o02hlE06og3RCA%2BQP37B3ejsZSxPdM%3D&reserved=0 
>>>>>
>>>>> DM6PR12MB2619.namprd12.prod.outlook.com/
>>>>>
>>>>> If that reset also failed, then no point in another reset, or keep 
>>>>> it as part of
>>>>> resume.
>>>>>
>>>>
>>>> Alex's patch seems always do the ASIC reset at the end of AMDGPU 
>>>> device, but that may needn't on the normal AMDGPU device suspend. 
>>>> However, this patch shows that can handle the system suspend aborted 
>>>> after AMDGPU suspend case well, so now seems we only need take care 
>>>> suspend abort case here.
>>>>
>>>
>>> Yeah, I was thinking we'd take this patch rather than mine to minimize
>>> the number of resets.
>>>
>>
>> Wondering if suspend fails whether there is a need to call resume. It 
>> may not get to resume() to do a reset, that needs to be checked.
> 
> I would rather do it so that we reset the ASIC during resume when we 
> detect that the hardware is in a bad state.
> 
> This way it should also work with things like kexec and virtualization.

I was thinking from the power framework perspective. If the device's 
suspend() callback returns failure, why would the framework needs to 
call a resume() on that device.

Thanks,
Lijo

> 
> Regards,
> Christian.
> 
>>
>> Thanks,
>> Lijo
>>
>>
>>> Alex
>>>
>>>
>>>>> Thanks,
>>>>> Lijo
>>>>>
>>>>>>
>>>>>>      atomic_t                        in_gpu_reset;
>>>>>>      enum pp_mp1_state               mp1_state;
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> index ec42a6f..a12ed54 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -3983,6 +3983,10 @@ int amdgpu_device_resume(struct drm_device
>>>>> *dev, bool fbcon)
>>>>>>      if (adev->in_s0ix)
>>>>>>              amdgpu_gfx_state_change_set(adev,
>>>>> sGpuChangeState_D0Entry);
>>>>>>
>>>>>> +   if (!adev->pm_completed) {
>>>>>> +           dev_warn(adev->dev, "suspend aborted will do asic 
>>>>>> reset\n");
>>>>>> +           amdgpu_asic_reset(adev);
>>>>>> +   }
>>>>>>      /* post card */
>>>>>>      if (amdgpu_device_need_post(adev)) {
>>>>>>              r = amdgpu_device_asic_init(adev); diff --git
>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>> index eee3cf8..9f90017 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>> @@ -2168,6 +2168,23 @@ static int amdgpu_pmops_suspend(struct
>>>>> device *dev)
>>>>>>      return r;
>>>>>>    }
>>>>>>
>>>>>> +/*
>>>>>> + * Actually the PM suspend whether is completed should be confirmed
>>>>>> + * by checking the sysfs
>>>>>> +sys/power/suspend_stats/failed_suspend.However,
>>>>>> + * in this function only check the AMDGPU device whether is 
>>>>>> suspended
>>>>>> + * completely in the system-wide suspend process.
>>>>>> + */
>>>>>> +static int amdgpu_pmops_noirq_suspend(struct device *dev) {
>>>>>> +   struct drm_device *drm_dev = dev_get_drvdata(dev);
>>>>>> +   struct amdgpu_device *adev = drm_to_adev(drm_dev);
>>>>>> +
>>>>>> +   dev_dbg(dev, "amdgpu suspend completely.\n");
>>>>>> +   adev->pm_completed = true;
>>>>>> +
>>>>>> +   return 0;
>>>>>> +}
>>>>>> +
>>>>>>    static int amdgpu_pmops_resume(struct device *dev)
>>>>>>    {
>>>>>>      struct drm_device *drm_dev = dev_get_drvdata(dev); @@ -2181,6
>>>>>> +2198,7 @@ static int amdgpu_pmops_resume(struct device *dev)
>>>>>>      r = amdgpu_device_resume(drm_dev, true);
>>>>>>      if (amdgpu_acpi_is_s0ix_active(adev))
>>>>>>              adev->in_s0ix = false;
>>>>>> +   adev->pm_completed = false;
>>>>>>      return r;
>>>>>>    }
>>>>>>
>>>>>> @@ -2397,6 +2415,7 @@ static const struct dev_pm_ops amdgpu_pm_ops
>>>>> = {
>>>>>>      .runtime_suspend = amdgpu_pmops_runtime_suspend,
>>>>>>      .runtime_resume = amdgpu_pmops_runtime_resume,
>>>>>>      .runtime_idle = amdgpu_pmops_runtime_idle,
>>>>>> +   .suspend_noirq = amdgpu_pmops_noirq_suspend,
>>>>>>    };
>>>>>>
>>>>>>    static int amdgpu_flush(struct file *f, fl_owner_t id)
>>>>>>
> 


More information about the amd-gfx mailing list