[PATCH] drm/amdgpu: reset asic after system-wide suspend aborted

Mon Nov 22 14:10:25 UTC 2021

[Public]

> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar at amd.com>
> Sent: Thursday, November 18, 2021 10:28 PM
> To: Alex Deucher <alexdeucher at gmail.com>
> Cc: Christian König <ckoenig.leichtzumerken at gmail.com>; Liang, Prike
> <Prike.Liang at amd.com>; Deucher, Alexander
> <Alexander.Deucher at amd.com>; Huang, Ray <Ray.Huang at amd.com>;
> amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdgpu: reset asic after system-wide suspend
> aborted
>
>
>
> On 11/18/2021 7:55 PM, Alex Deucher wrote:
> > On Thu, Nov 18, 2021 at 9:15 AM Lazar, Lijo <lijo.lazar at amd.com> wrote:
> >>
> >>
> >>
> >> On 11/18/2021 7:41 PM, Christian König wrote:
> >>> Am 18.11.21 um 15:09 schrieb Lazar, Lijo:
> >>>> On 11/18/2021 7:36 PM, Alex Deucher wrote:
> >>>>> On Thu, Nov 18, 2021 at 8:11 AM Liang, Prike <Prike.Liang at amd.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> [Public]
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
> >>>>>>> Sent: Thursday, November 18, 2021 4:01 PM
> >>>>>>> To: Liang, Prike <Prike.Liang at amd.com>;
> >>>>>>> amd-gfx at lists.freedesktop.org
> >>>>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Huang,
> Ray
> >>>>>>> <Ray.Huang at amd.com>
> >>>>>>> Subject: Re: [PATCH] drm/amdgpu: reset asic after system-wide
> >>>>>>> suspend aborted
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 11/18/2021 12:32 PM, Prike Liang wrote:
> >>>>>>>> Do ASIC reset at the moment Sx suspend aborted behind of
> amdgpu
> >>>>>>>> suspend to keep AMDGPU in a clean reset state and that can
> >>>>>>>> avoid re-initialize device improperly error.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Prike Liang <Prike.Liang at amd.com>
> >>>>>>>> ---
> >>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
> >>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 ++++
> >>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 19
> >>>>>>> +++++++++++++++++++
> >>>>>>>>     3 files changed, 24 insertions(+)
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >>>>>>>> index b85b67a..8bd9833 100644
> >>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >>>>>>>> @@ -1053,6 +1053,7 @@ struct amdgpu_device {
> >>>>>>>>       bool                            in_s3;
> >>>>>>>>       bool                            in_s4;
> >>>>>>>>       bool                            in_s0ix;
> >>>>>>>> +   bool                            pm_completed;
> >>>>>>>
> >>>>>>> PM framework maintains separate flags, why not use the same?
> >>>>>>>
> >>>>>>>            dev->power.is_suspended = false;
> >>>>>>>            dev->power.is_noirq_suspended = false;
> >>>>>>>            dev->power.is_late_suspended = false;
> >>>>>>>
> >>>>>>
> >>>>>> Thanks for pointing it out and I miss that flag. In this case we
> >>>>>> can use the PM flag is_noirq_suspended for checking AMDGPU
> device
> >>>>>> whether is finished suspend.
> >>>>>>
> >>>>>>> BTW, Alex posted a patch which does similar thing, though it
> >>>>>>> tries reset if suspend fails.
> >>>>>>>
> >>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%
> >>>>>>>
> 2Flore.kernel.org%2Fall%2FDM6PR12MB26195F8E099407B4B6966FEBE4999
> >>>>>>> %40&data=04%7C01%7Clijo.lazar%40amd.com%7C2ce211aeee
> b448f6cb
> >>>>>>>
> 2c08d9aa9f4741%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63
> 77
> >>>>>>>
> 28423343483218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiL
> CJQI
> >>>>>>>
> joiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=nyzhG
> >>>>>>>
> wTJV83YZkit34Bb%2B5tBxGEMvFzCyZPjz8eSDgc%3D&reserved=0
> >>>>>>>
> >>>>>>> DM6PR12MB2619.namprd12.prod.outlook.com/
> >>>>>>>
> >>>>>>> If that reset also failed, then no point in another reset, or
> >>>>>>> keep it as part of resume.
> >>>>>>>
> >>>>>>
> >>>>>> Alex's patch seems always do the ASIC reset at the end of AMDGPU
> >>>>>> device, but that may needn't on the normal AMDGPU device suspend.
> >>>>>> However, this patch shows that can handle the system suspend
> aborted
> >>>>>> after AMDGPU suspend case well, so now seems we only need take
> care
> >>>>>> suspend abort case here.
> >>>>>>
> >>>>>
> >>>>> Yeah, I was thinking we'd take this patch rather than mine to minimize
> >>>>> the number of resets.
> >>>>>
> >>>>
> >>>> Wondering if suspend fails whether there is a need to call resume. It
> >>>> may not get to resume() to do a reset, that needs to be checked.
> >>>
> >>> I would rather do it so that we reset the ASIC during resume when we
> >>> detect that the hardware is in a bad state.
> >>>
> >>> This way it should also work with things like kexec and virtualization.
> >>
> >> I was thinking from the power framework perspective. If the device's
> >> suspend() callback returns failure, why would the framework needs to
> >> call a resume() on that device.
> >
> > The device's suspend callback succeeds, but some other device or event
> > in the system causes the overall suspend to abort.  As such the GPU is
> > never powered off by the platform so it's left in a partially
> > initialized state.  The system then calls the resume() callbacks for
> > all of the devices that were previously suspended to bring them back
> > to a working system.  GPU reset is just a convenient way to get the
> > device back into a known good state.  Unfortunately, I'm not sure if
> > there is a good way to detect whether the GPU is in a known good state
> > or not until we try and re-init the IPs and at that point the IPs are
> > not fully initialized so it's complex to try and unwind that, reset,
> > and try again.  It's probably easiest to just reset the GPU on resume
> > all the time.  If the GPU was powered down, the reset should work fine
> > since we are resetting a GPU that just came out of reset.  If the GPU
> > was not powered down, the reset will put the GPU into a known good
> > state.
> >
>
> https://elixir.bootlin.com/linux/latest/source/drivers/base/power/main.c#L
> 925
>
> I don't have a machine to trace the path. The flag is set only if
> suspend is succesful.

The PM flags are set on the PM opt callback processed successfully and then reset after relevant PM resume callback complete undoing process. Regards to sort out the system suspend abort after AMDGPU suspend seems not easy only use PM flags(is_suspended, .is_late_suspended ,is_noirq_suspended) since in the AMDGPU device resume phase only leave the flags is_suspended =1 and is_late_suspended/ is_noirq_suspended =0 no matter normal system suspend or system suspend process abort after AMDGPU suspend. In this patch the flag pm_completed only can handle the case which abort system suspend before call AMDGPU noirq suspend, and can't well cover all the system abort corner case such as system abort after AMDGPU noirq phase. So, how about check the PM suspend state by using the sys file  /sys/power/suspend_stats/failed_suspend  instead of tracing the flags of PM framework or driver callback defined and only do the GPU reset in the amdgpu_device_resume() when detect failed suspend?

If we give up checking suspend state and always do the GPU reset in the amdgpu_device_resume(), then that may have a side effect on the system resume latency?

> Thanks,
> Lijo
>
> > Alex
> >
> >>
> >> Thanks,
> >> Lijo
> >>
> >>>
> >>> Regards,
> >>> Christian.
> >>>
> >>>>
> >>>> Thanks,
> >>>> Lijo
> >>>>
> >>>>
> >>>>> Alex
> >>>>>
> >>>>>
> >>>>>>> Thanks,
> >>>>>>> Lijo
> >>>>>>>
> >>>>>>>>
> >>>>>>>>       atomic_t                        in_gpu_reset;
> >>>>>>>>       enum pp_mp1_state               mp1_state;
> >>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>>>>> index ec42a6f..a12ed54 100644
> >>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>>>>>>> @@ -3983,6 +3983,10 @@ int amdgpu_device_resume(struct
> drm_device
> >>>>>>> *dev, bool fbcon)
> >>>>>>>>       if (adev->in_s0ix)
> >>>>>>>>               amdgpu_gfx_state_change_set(adev,
> >>>>>>> sGpuChangeState_D0Entry);
> >>>>>>>>
> >>>>>>>> +   if (!adev->pm_completed) {
> >>>>>>>> +           dev_warn(adev->dev, "suspend aborted will do asic
> >>>>>>>> reset\n");
> >>>>>>>> +           amdgpu_asic_reset(adev);
> >>>>>>>> +   }
> >>>>>>>>       /* post card */
> >>>>>>>>       if (amdgpu_device_need_post(adev)) {
> >>>>>>>>               r = amdgpu_device_asic_init(adev); diff --git
> >>>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>>>>>> index eee3cf8..9f90017 100644
> >>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>>>>>>> @@ -2168,6 +2168,23 @@ static int
> amdgpu_pmops_suspend(struct
> >>>>>>> device *dev)
> >>>>>>>>       return r;
> >>>>>>>>     }
> >>>>>>>>
> >>>>>>>> +/*
> >>>>>>>> + * Actually the PM suspend whether is completed should be
> confirmed
> >>>>>>>> + * by checking the sysfs
> >>>>>>>> +sys/power/suspend_stats/failed_suspend.However,
> >>>>>>>> + * in this function only check the AMDGPU device whether is
> >>>>>>>> suspended
> >>>>>>>> + * completely in the system-wide suspend process.
> >>>>>>>> + */
> >>>>>>>> +static int amdgpu_pmops_noirq_suspend(struct device *dev) {
> >>>>>>>> +   struct drm_device *drm_dev = dev_get_drvdata(dev);
> >>>>>>>> +   struct amdgpu_device *adev = drm_to_adev(drm_dev);
> >>>>>>>> +
> >>>>>>>> +   dev_dbg(dev, "amdgpu suspend completely.\n");
> >>>>>>>> +   adev->pm_completed = true;
> >>>>>>>> +
> >>>>>>>> +   return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>>     static int amdgpu_pmops_resume(struct device *dev)
> >>>>>>>>     {
> >>>>>>>>       struct drm_device *drm_dev = dev_get_drvdata(dev); @@ -
> 2181,6
> >>>>>>>> +2198,7 @@ static int amdgpu_pmops_resume(struct device *dev)
> >>>>>>>>       r = amdgpu_device_resume(drm_dev, true);
> >>>>>>>>       if (amdgpu_acpi_is_s0ix_active(adev))
> >>>>>>>>               adev->in_s0ix = false;
> >>>>>>>> +   adev->pm_completed = false;
> >>>>>>>>       return r;
> >>>>>>>>     }
> >>>>>>>>
> >>>>>>>> @@ -2397,6 +2415,7 @@ static const struct dev_pm_ops
> amdgpu_pm_ops
> >>>>>>> = {
> >>>>>>>>       .runtime_suspend = amdgpu_pmops_runtime_suspend,
> >>>>>>>>       .runtime_resume = amdgpu_pmops_runtime_resume,
> >>>>>>>>       .runtime_idle = amdgpu_pmops_runtime_idle,
> >>>>>>>> +   .suspend_noirq = amdgpu_pmops_noirq_suspend,
> >>>>>>>>     };
> >>>>>>>>
> >>>>>>>>     static int amdgpu_flush(struct file *f, fl_owner_t id)
> >>>>>>>>
> >>>