[PATCH 2/2] drm/amdgpu: reset gpu for pm abort case

Liang, Prike Prike.Liang at amd.com
Tue Jan 30 08:50:26 UTC 2024


[AMD Official Use Only - General]

> From: Lazar, Lijo <Lijo.Lazar at amd.com>
> Sent: Monday, January 29, 2024 2:48 PM
> To: Liang, Prike <Prike.Liang at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Sharma, Deepak
> <Deepak.Sharma at amd.com>
> Subject: Re: [PATCH 2/2] drm/amdgpu: reset gpu for pm abort case
>
>
>
> On 1/26/2024 2:30 PM, Liang, Prike wrote:
> > [AMD Official Use Only - General]
> >
> >>
> >> On 1/25/2024 8:52 AM, Prike Liang wrote:
> >>> In the pm abort case the gfx power rail not turn off from FCH side
> >>> and this will lead to the gfx reinitialized failed base on the
> >>> unknown gfx HW status, so let's reset the gpu to a known good power
> state.
> >>>
> >>
> >> From the description, this an APU only problem (or this patch could
> >> only resolve APU abort sequence). However, there is no check for APU
> >> in the patch below.
> >>
> > [Prike]  IIRC, there also has a similar problem on the dGPU side when
> > suspend abort and now this patch is only drafted for a hot issue on
> > the RV series. If need we can add a TODO item for drafting a more generic
> solution.
> >
>
> If this addresses a specific issue, then better to check the specific IP revision
> before presenting this as a generic one. Presently the patch logic considers
> this as a generic for all soc15 asics.
>
Before someone can further confirm whether there's a similar problem on the dGPU device side then I prefer to limit this quirk only on some specific ASIC.

> >>
> >>> Signed-off-by: Prike Liang <Prike.Liang at amd.com>
> >>> ---
> >>>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +++++
> >>>  drivers/gpu/drm/amd/amdgpu/soc15.c         | 8 +++++++-
> >>>  2 files changed, 12 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> index 56d9dfa61290..4c40ffaaa5c2 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> >>> @@ -4627,6 +4627,11 @@ int amdgpu_device_resume(struct
> drm_device
> >> *dev, bool fbcon)
> >>>                     return r;
> >>>     }
> >>>
> >>> +   if(amdgpu_asic_need_reset_on_init(adev)) {
> >>> +           DRM_INFO("PM abort case and let's reset asic \n");
> >>> +           amdgpu_asic_reset(adev);
> >>> +   }
> >>> +
> >>
> >> suspend_noirq is specific for suspend scenarios and not valid for
> freeze/thaw.
> >> I guess this could trigger reset for successful restore on APUs.
> >>
> > [Prike] If doesn't run into noirq_suspend then still need further check
> whether the PSP TOS is still alive before gpu reset.
> >
>
> AFAIU, for a successful resume from hibernate on APUs, TOS will still be
> running. The patch will trigger a reset in such cases also.
>
> Thanks,
> Lijo
>
Yes, during the system try to restore the saved image the TOS should be running at that moment so will filter out the hibernate resume case in the later patch.

Thanks,
Prike
> >>>     if (dev->switch_power_state == DRM_SWITCH_POWER_OFF)
> >>>             return 0;
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c
> >>> b/drivers/gpu/drm/amd/amdgpu/soc15.c
> >>> index 15033efec2ba..9329a00b6abc 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/soc15.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
> >>> @@ -804,9 +804,16 @@ static bool soc15_need_reset_on_init(struct
> >> amdgpu_device *adev)
> >>>     if (adev->asic_type == CHIP_RENOIR)
> >>>             return true;
> >>>
> >>> +   sol_reg = RREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_81);
> >>> +
> >>>     /* Just return false for soc15 GPUs.  Reset does not seem to
> >>>      * be necessary.
> >>>      */
> >>
> >> The comment now doesn't make sense.
> >>
> >> Thanks,
> >> Lijo
> >>
> >>> +   if (adev->in_suspend && !adev->in_s0ix &&
> >>> +                   !adev->pm_complete &&
> >>> +                   sol_reg)
> >>> +           return true;
> >>> +
> >>>     if (!amdgpu_passthrough(adev))
> >>>             return false;
> >>>
> >>> @@ -816,7 +823,6 @@ static bool soc15_need_reset_on_init(struct
> >> amdgpu_device *adev)
> >>>     /* Check sOS sign of life register to confirm sys driver and sOS
> >>>      * are already been loaded.
> >>>      */
> >>> -   sol_reg = RREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_81);
> >>>     if (sol_reg)
> >>>             return true;
> >>>


More information about the amd-gfx mailing list