[PATCH] drm/amdgpu: reset asic after system-wide suspend aborted

Thu Nov 18 14:28:22 UTC 2021

On 11/18/2021 7:55 PM, Alex Deucher wrote:
> On Thu, Nov 18, 2021 at 9:15 AM Lazar, Lijo <lijo.lazar at amd.com> wrote:
>>
>>
>>
>> On 11/18/2021 7:41 PM, Christian König wrote:
>>> Am 18.11.21 um 15:09 schrieb Lazar, Lijo:
>>>> On 11/18/2021 7:36 PM, Alex Deucher wrote:
>>>>> On Thu, Nov 18, 2021 at 8:11 AM Liang, Prike <Prike.Liang at amd.com>
>>>>> wrote:
>>>>>>
>>>>>> [Public]
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>>>>>>> Sent: Thursday, November 18, 2021 4:01 PM
>>>>>>> To: Liang, Prike <Prike.Liang at amd.com>; amd-gfx at lists.freedesktop.org
>>>>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Huang, Ray
>>>>>>> <Ray.Huang at amd.com>
>>>>>>> Subject: Re: [PATCH] drm/amdgpu: reset asic after system-wide suspend
>>>>>>> aborted
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 11/18/2021 12:32 PM, Prike Liang wrote:
>>>>>>>> Do ASIC reset at the moment Sx suspend aborted behind of amdgpu
>>>>>>>> suspend to keep AMDGPU in a clean reset state and that can avoid
>>>>>>>> re-initialize device improperly error.
>>>>>>>>
>>>>>>>> Signed-off-by: Prike Liang <Prike.Liang at amd.com>
>>>>>>>> ---
>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++++
>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 19
>>>>>>> +++++++++++++++++++
>>>>>>>>     3 files changed, 24 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>>> index b85b67a..8bd9833 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>>>>> @@ -1053,6 +1053,7 @@ struct amdgpu_device {
>>>>>>>>       bool                            in_s3;
>>>>>>>>       bool                            in_s4;
>>>>>>>>       bool                            in_s0ix;
>>>>>>>> +   bool                            pm_completed;
>>>>>>>
>>>>>>> PM framework maintains separate flags, why not use the same?
>>>>>>>
>>>>>>>            dev->power.is_suspended = false;
>>>>>>>            dev->power.is_noirq_suspended = false;
>>>>>>>            dev->power.is_late_suspended = false;
>>>>>>>
>>>>>>
>>>>>> Thanks for pointing it out and I miss that flag. In this case we can
>>>>>> use the PM flag is_noirq_suspended for checking AMDGPU device
>>>>>> whether is finished suspend.
>>>>>>
>>>>>>> BTW, Alex posted a patch which does similar thing, though it tries
>>>>>>> reset if
>>>>>>> suspend fails.
>>>>>>>
>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FDM6PR12MB26195F8E099407B4B6966FEBE4999%40&data=04%7C01%7Clijo.lazar%40amd.com%7C2ce211aeeeb448f6cb2c08d9aa9f4741%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637728423343483218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=nyzhGwTJV83YZkit34Bb%2B5tBxGEMvFzCyZPjz8eSDgc%3D&reserved=0
>>>>>>>
>>>>>>> DM6PR12MB2619.namprd12.prod.outlook.com/
>>>>>>>
>>>>>>> If that reset also failed, then no point in another reset, or keep
>>>>>>> it as part of
>>>>>>> resume.
>>>>>>>
>>>>>>
>>>>>> Alex's patch seems always do the ASIC reset at the end of AMDGPU
>>>>>> device, but that may needn't on the normal AMDGPU device suspend.
>>>>>> However, this patch shows that can handle the system suspend aborted
>>>>>> after AMDGPU suspend case well, so now seems we only need take care
>>>>>> suspend abort case here.
>>>>>>
>>>>>
>>>>> Yeah, I was thinking we'd take this patch rather than mine to minimize
>>>>> the number of resets.
>>>>>
>>>>
>>>> Wondering if suspend fails whether there is a need to call resume. It
>>>> may not get to resume() to do a reset, that needs to be checked.
>>>
>>> I would rather do it so that we reset the ASIC during resume when we
>>> detect that the hardware is in a bad state.
>>>
>>> This way it should also work with things like kexec and virtualization.
>>
>> I was thinking from the power framework perspective. If the device's
>> suspend() callback returns failure, why would the framework needs to
>> call a resume() on that device.
> 
> The device's suspend callback succeeds, but some other device or event
> in the system causes the overall suspend to abort.  As such the GPU is
> never powered off by the platform so it's left in a partially
> initialized state.  The system then calls the resume() callbacks for
> all of the devices that were previously suspended to bring them back
> to a working system.  GPU reset is just a convenient way to get the
> device back into a known good state.  Unfortunately, I'm not sure if
> there is a good way to detect whether the GPU is in a known good state
> or not until we try and re-init the IPs and at that point the IPs are
> not fully initialized so it's complex to try and unwind that, reset,
> and try again.  It's probably easiest to just reset the GPU on resume
> all the time.  If the GPU was powered down, the reset should work fine
> since we are resetting a GPU that just came out of reset.  If the GPU
> was not powered down, the reset will put the GPU into a known good
> state.
> 

https://elixir.bootlin.com/linux/latest/source/drivers/base/power/main.c#L925

I don't have a machine to trace the path. The flag is set only if 
suspend is succesful.

Thanks,
Lijo

> Alex
> 
>>
>> Thanks,
>> Lijo
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>
>>>>> Alex
>>>>>
>>>>>
>>>>>>> Thanks,
>>>>>>> Lijo
>>>>>>>
>>>>>>>>
>>>>>>>>       atomic_t                        in_gpu_reset;
>>>>>>>>       enum pp_mp1_state               mp1_state;
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> index ec42a6f..a12ed54 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> @@ -3983,6 +3983,10 @@ int amdgpu_device_resume(struct drm_device
>>>>>>> *dev, bool fbcon)
>>>>>>>>       if (adev->in_s0ix)
>>>>>>>>               amdgpu_gfx_state_change_set(adev,
>>>>>>> sGpuChangeState_D0Entry);
>>>>>>>>
>>>>>>>> +   if (!adev->pm_completed) {
>>>>>>>> +           dev_warn(adev->dev, "suspend aborted will do asic
>>>>>>>> reset\n");
>>>>>>>> +           amdgpu_asic_reset(adev);
>>>>>>>> +   }
>>>>>>>>       /* post card */
>>>>>>>>       if (amdgpu_device_need_post(adev)) {
>>>>>>>>               r = amdgpu_device_asic_init(adev); diff --git
>>>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>> index eee3cf8..9f90017 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>>>>>> @@ -2168,6 +2168,23 @@ static int amdgpu_pmops_suspend(struct
>>>>>>> device *dev)
>>>>>>>>       return r;
>>>>>>>>     }
>>>>>>>>
>>>>>>>> +/*
>>>>>>>> + * Actually the PM suspend whether is completed should be confirmed
>>>>>>>> + * by checking the sysfs
>>>>>>>> +sys/power/suspend_stats/failed_suspend.However,
>>>>>>>> + * in this function only check the AMDGPU device whether is
>>>>>>>> suspended
>>>>>>>> + * completely in the system-wide suspend process.
>>>>>>>> + */
>>>>>>>> +static int amdgpu_pmops_noirq_suspend(struct device *dev) {
>>>>>>>> +   struct drm_device *drm_dev = dev_get_drvdata(dev);
>>>>>>>> +   struct amdgpu_device *adev = drm_to_adev(drm_dev);
>>>>>>>> +
>>>>>>>> +   dev_dbg(dev, "amdgpu suspend completely.\n");
>>>>>>>> +   adev->pm_completed = true;
>>>>>>>> +
>>>>>>>> +   return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>     static int amdgpu_pmops_resume(struct device *dev)
>>>>>>>>     {
>>>>>>>>       struct drm_device *drm_dev = dev_get_drvdata(dev); @@ -2181,6
>>>>>>>> +2198,7 @@ static int amdgpu_pmops_resume(struct device *dev)
>>>>>>>>       r = amdgpu_device_resume(drm_dev, true);
>>>>>>>>       if (amdgpu_acpi_is_s0ix_active(adev))
>>>>>>>>               adev->in_s0ix = false;
>>>>>>>> +   adev->pm_completed = false;
>>>>>>>>       return r;
>>>>>>>>     }
>>>>>>>>
>>>>>>>> @@ -2397,6 +2415,7 @@ static const struct dev_pm_ops amdgpu_pm_ops
>>>>>>> = {
>>>>>>>>       .runtime_suspend = amdgpu_pmops_runtime_suspend,
>>>>>>>>       .runtime_resume = amdgpu_pmops_runtime_resume,
>>>>>>>>       .runtime_idle = amdgpu_pmops_runtime_idle,
>>>>>>>> +   .suspend_noirq = amdgpu_pmops_noirq_suspend,
>>>>>>>>     };
>>>>>>>>
>>>>>>>>     static int amdgpu_flush(struct file *f, fl_owner_t id)
>>>>>>>>
>>>