[PATCH] drm/amd/pm: fix the high voltage and temperature issue on smu 13

Alex Deucher alexdeucher at gmail.com
Mon Oct 23 14:56:31 UTC 2023


On Sun, Oct 22, 2023 at 9:05 PM Feng, Kenneth <Kenneth.Feng at amd.com> wrote:
>
> [AMD Official Use Only - General]
>
> Thanks Alex, I will make another patch.
> And please refer to the comments inline below.
>
>
> -----Original Message-----
> From: Alex Deucher <alexdeucher at gmail.com>
> Sent: Friday, October 20, 2023 9:58 PM
> To: Feng, Kenneth <Kenneth.Feng at amd.com>
> Cc: amd-gfx at lists.freedesktop.org; Wang, Yang(Kevin) <KevinYang.Wang at amd.com>
> Subject: Re: [PATCH] drm/amd/pm: fix the high voltage and temperature issue on smu 13
>
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On Fri, Oct 20, 2023 at 4:32 AM Kenneth Feng <kenneth.feng at amd.com> wrote:
> >
> > fix the high voltage and temperature issue after the driver is
> > unloaded on smu 13.0.0, smu 13.0.7 and smu 13.0.10
> >
> > Signed-off-by: Kenneth Feng <kenneth.feng at amd.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 36 +++++++++++++++----
> >  drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c        |  4 +--
> >  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c     | 27 ++++++++++++--
> >  drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h |  1 +
> > drivers/gpu/drm/amd/pm/swsmu/inc/smu_v13_0.h  |  2 ++
> >  .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c    | 13 +++++++
> >  .../drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c  |  8 ++++-
> > .../drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c  |  8 ++++-
> >  8 files changed, 86 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 31f8c3ead161..c5c892a8b3f9 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -3986,13 +3986,23 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> >                                 }
> >                         }
> >                 } else {
> > -                       tmp = amdgpu_reset_method;
> > -                       /* It should do a default reset when loading or reloading the driver,
> > -                        * regardless of the module parameter reset_method.
> > -                        */
> > -                       amdgpu_reset_method = AMD_RESET_METHOD_NONE;
> > -                       r = amdgpu_asic_reset(adev);
> > -                       amdgpu_reset_method = tmp;
> > +                       switch (amdgpu_ip_version(adev, MP1_HWIP, 0)) {
> > +                       case IP_VERSION(13, 0, 0):
> > +                       case IP_VERSION(13, 0, 7):
> > +                       case IP_VERSION(13, 0, 10):
> > +                               r = psp_gpu_reset(adev);
> > +                               break;
> > +                       default:
> > +                               tmp = amdgpu_reset_method;
> > +                               /* It should do a default reset when loading or reloading the driver,
> > +                                * regardless of the module parameter reset_method.
> > +                                */
> > +                               amdgpu_reset_method = AMD_RESET_METHOD_NONE;
> > +                               r = amdgpu_asic_reset(adev);
> > +                               amdgpu_reset_method = tmp;
> > +                               break;
> > +                       }
> > +
> >                         if (r) {
> >                                 dev_err(adev->dev, "asic reset on init failed\n");
> >                                 goto failed; @@ -5945,6 +5955,18 @@
> > int amdgpu_device_baco_exit(struct drm_device *dev)
> >                 return -ENOTSUPP;
> >
> >         ret = amdgpu_dpm_baco_exit(adev);
> > +
> > +       if (!ret)
> > +               switch (amdgpu_ip_version(adev, MP1_HWIP, 0)) {
> > +               case IP_VERSION(13, 0, 0):
> > +               case IP_VERSION(13, 0, 7):
> > +               case IP_VERSION(13, 0, 10):
> > +                       adev->gfx.is_poweron = false;
> > +                       break;
> > +               default:
> > +                       break;
> > +               }
>
> Maybe better to move this into smu_v13_0_0_baco_exit() so we keep the asic specific details out of the common files?
>
> > +
> >         if (ret)
> >                 return ret;
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> > b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> > index 80ca2c05b0b8..3ad38e42773b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> > @@ -73,7 +73,7 @@ gmc_v11_0_vm_fault_interrupt_state(struct amdgpu_device *adev,
> >                  * fini/suspend, so the overall state doesn't
> >                  * change over the course of suspend/resume.
> >                  */
> > -               if (!adev->in_s0ix)
> > +               if (!adev->in_s0ix && adev->gfx.is_poweron)
> >                         amdgpu_gmc_set_vm_fault_masks(adev, AMDGPU_GFXHUB(0), false);
> >                 break;
> >         case AMDGPU_IRQ_STATE_ENABLE:
> > @@ -85,7 +85,7 @@ gmc_v11_0_vm_fault_interrupt_state(struct amdgpu_device *adev,
> >                  * fini/suspend, so the overall state doesn't
> >                  * change over the course of suspend/resume.
> >                  */
> > -               if (!adev->in_s0ix)
> > +               if (!adev->in_s0ix && adev->gfx.is_poweron)
> >                         amdgpu_gmc_set_vm_fault_masks(adev, AMDGPU_GFXHUB(0), true);
> >                 break;
> >         default:
>
>
> These changes are probably a valid bug fix on their own.
> [Kenneth] -  When driver is unloaded, gfx core is powered off first. Then in gmc_hw_fini, the gfxhub interruption operation needs to be skipped. Do we need a separate patch for this?

Would this trigger in any other cases?  E.g., suspend/resume?  If so,
I think it makes sense as a standalone bug fix.  If not, it's fine to
include it in this patch.

>
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> > b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> > index 7c3356d6da5e..30e5f7161737 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> > @@ -733,7 +733,7 @@ static int smu_early_init(void *handle)
> >         smu->adev = adev;
> >         smu->pm_enabled = !!amdgpu_dpm;
> >         smu->is_apu = false;
> > -       smu->smu_baco.state = SMU_BACO_STATE_EXIT;
> > +       smu->smu_baco.state = SMU_BACO_STATE_NONE;
>
>
> I'm not sure I understand this change.  Is this just to set the default BACO state?  Maybe this would be better as a separate patch.
> [Kenneth] - smu->smu_baco.state is needed when driver is unloaded, if it's baco exited, then need to reset MP1_FIRMWARE_FLAG = 0, otherwise MP1_FIRMWARE_FLAG doesn't need to be reset.
> Currently by default  smu->smu_baco.state is baco exited status, we can't recognize if it's really a hardware baco exited status. Do you think we still need a separate patch for it?

No, this makes sense.  I just want to verify the reason for the
change.  It would be nice to include this detail in the commit
message.

>
> >         smu->smu_baco.platform_support = false;
> >         smu->user_dpm_profile.fan_mode = -1;
> >
> > @@ -1740,10 +1740,25 @@ static int smu_smc_hw_cleanup(struct smu_context *smu)
> >         return 0;
> >  }
> >
> > +static int smu_reset_mp1_state(struct smu_context *smu) {
> > +       struct amdgpu_device *adev = smu->adev;
> > +
> > +       switch (amdgpu_ip_version(adev, MP1_HWIP, 0)) {
> > +       case IP_VERSION(13, 0, 0):
> > +       case IP_VERSION(13, 0, 7):
> > +       case IP_VERSION(13, 0, 10):
> > +               return smu_set_mp1_state(smu, PP_MP1_STATE_UNLOAD);
> > +       default:
> > +               return 0;
> > +       }
> > +}
> > +
> >  static int smu_hw_fini(void *handle)
> >  {
> >         struct amdgpu_device *adev = (struct amdgpu_device *)handle;
> >         struct smu_context *smu = adev->powerplay.pp_handle;
> > +       int ret;
> >
> >         if (amdgpu_sriov_vf(adev) && !amdgpu_sriov_is_pp_one_vf(adev))
> >                 return 0;
> > @@ -1761,7 +1776,15 @@ static int smu_hw_fini(void *handle)
> >
> >         adev->pm.dpm_enabled = false;
> >
> > -       return smu_smc_hw_cleanup(smu);
> > +       ret = smu_smc_hw_cleanup(smu);
> > +       if (ret)
> > +               return ret;
> > +
> > +       ret = smu_reset_mp1_state(smu);
> > +       if (ret)
> > +               return ret;
> > +
> > +       return 0;
> >  }
> >
> >  static void smu_late_fini(void *handle) diff --git
> > a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
> > b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
> > index 1454eed76604..9f2dbc90b606 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h
> > @@ -419,6 +419,7 @@ enum smu_reset_mode {  enum smu_baco_state {
> >         SMU_BACO_STATE_ENTER = 0,
> >         SMU_BACO_STATE_EXIT,
> > +       SMU_BACO_STATE_NONE,
> >  };
> >
> >  struct smu_baco_context {
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/inc/smu_v13_0.h
> > b/drivers/gpu/drm/amd/pm/swsmu/inc/smu_v13_0.h
> > index cc02f979e9e9..43c7ba68eb50 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/inc/smu_v13_0.h
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/inc/smu_v13_0.h
> > @@ -299,5 +299,7 @@ int smu_v13_0_update_pcie_parameters(struct smu_context *smu,
> >                                      uint8_t pcie_gen_cap,
> >                                      uint8_t pcie_width_cap);
> >
> > +int smu_v13_0_disable_pmfw_state(struct smu_context* smu);
> > +
> >  #endif
> >  #endif
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
> > b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
> > index bcb7ab9d2221..0724441e53ef 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
> > @@ -2473,3 +2473,16 @@ int smu_v13_0_update_pcie_parameters(struct
> > smu_context *smu,
> >
> >         return 0;
> >  }
> > +
> > +int smu_v13_0_disable_pmfw_state(struct smu_context* smu) {
> > +       int ret;
> > +       struct amdgpu_device *adev = smu->adev;
> > +
> > +       WREG32_PCIE(MP1_Public | (smnMP1_FIRMWARE_FLAGS & 0xffffffff),
> > + 0);
> > +
> > +       ret = RREG32_PCIE(MP1_Public |
> > +                                          (smnMP1_FIRMWARE_FLAGS &
> > + 0xffffffff));
> > +
> > +       return ret == 0 ? 0 : -EINVAL; }
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c
> > b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c
> > index 47d008cbc186..0a167f70f4bc 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c
> > @@ -2758,7 +2758,13 @@ static int smu_v13_0_0_set_mp1_state(struct
> > smu_context *smu,
> >
> >         switch (mp1_state) {
> >         case PP_MP1_STATE_UNLOAD:
> > -               ret = smu_cmn_set_mp1_state(smu, mp1_state);
> > +               ret = smu_cmn_send_smc_msg_with_param(smu,
> > +                                                               SMU_MSG_PrepareMp1ForUnload,
> > +                                                               0x55,
> > + NULL);
> > +
> > +               if(!ret && smu->smu_baco.state == SMU_BACO_STATE_EXIT)
>
> space between if and (
>
> > +                       ret = smu_v13_0_disable_pmfw_state(smu);
> > +
> >                 break;
> >         default:
> >                 /* Ignore others */
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c
> > b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c
> > index b8a7a1d853df..d7a4a03b1e31 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c
> > @@ -2429,7 +2429,13 @@ static int smu_v13_0_7_set_mp1_state(struct
> > smu_context *smu,
> >
> >         switch (mp1_state) {
> >         case PP_MP1_STATE_UNLOAD:
> > -               ret = smu_cmn_set_mp1_state(smu, mp1_state);
> > +               ret = smu_cmn_send_smc_msg_with_param(smu,
> > +                                                               SMU_MSG_PrepareMp1ForUnload,
> > +                                                               0x55,
> > + NULL);
> > +
> > +               if(!ret && smu->smu_baco.state == SMU_BACO_STATE_EXIT)
>
> Same here.
>
> Alex
>
> > +                       ret = smu_v13_0_disable_pmfw_state(smu);
> > +
> >                 break;
> >         default:
> >                 /* Ignore others */
> > --
> > 2.34.1
> >


More information about the amd-gfx mailing list