[PATCH] drm/amdgpu: add support to SMU debug option
Quan, Evan
Evan.Quan at amd.com
Thu Dec 2 02:47:46 UTC 2021
[AMD Official Use Only]
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Yu,
> Lang
> Sent: Wednesday, December 1, 2021 7:37 PM
> To: Koenig, Christian <Christian.Koenig at amd.com>; Christian König
> <ckoenig.leichtzumerken at gmail.com>; amd-gfx at lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Lazar, Lijo
> <Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>
> Subject: RE: [PATCH] drm/amdgpu: add support to SMU debug option
>
> [AMD Official Use Only]
>
>
>
> >-----Original Message-----
> >From: Koenig, Christian <Christian.Koenig at amd.com>
> >Sent: Wednesday, December 1, 2021 7:29 PM
> >To: Yu, Lang <Lang.Yu at amd.com>; Christian König
> ><ckoenig.leichtzumerken at gmail.com>; amd-gfx at lists.freedesktop.org
> >Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Lazar, Lijo
> ><Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>
> >Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
> >
> >Am 01.12.21 um 12:20 schrieb Yu, Lang:
> >> [AMD Official Use Only]
> >>
> >>> -----Original Message-----
> >>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
> >>> Sent: Wednesday, December 1, 2021 6:49 PM
> >>> To: Yu, Lang <Lang.Yu at amd.com>; Koenig, Christian
> >>> <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
> >>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Lazar, Lijo
> >>> <Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>
> >>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
> >>>
> >>> Am 01.12.21 um 11:44 schrieb Yu, Lang:
> >>>> [AMD Official Use Only]
> >>>>
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
> >>>>> Sent: Wednesday, December 1, 2021 5:30 PM
> >>>>> To: Yu, Lang <Lang.Yu at amd.com>; amd-gfx at lists.freedesktop.org
> >>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Lazar, Lijo
> >>>>> <Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>
> >>>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
> >>>>>
> >>>>> Am 01.12.21 um 10:24 schrieb Lang Yu:
> >>>>>> To maintain system error state when SMU errors occurred, which
> >>>>>> will aid in debugging SMU firmware issues, add SMU debug option
> support.
> >>>>>>
> >>>>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file.
> >>>>>> When enabled, it makes SMU errors fatal.
> >>>>>> It is disabled by default.
> >>>>>>
> >>>>>> == Command Guide ==
> >>>>>>
> >>>>>> 1, enable SMU debug option
> >>>>>>
> >>>>>> # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> >>>>>>
> >>>>>> 2, disable SMU debug option
> >>>>>>
> >>>>>> # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
> >>>>>>
> >>>>>> v3:
> >>>>>> - Use debugfs_create_bool().(Christian)
> >>>>>> - Put variable into smu_context struct.
> >>>>>> - Don't resend command when timeout.
> >>>>>>
> >>>>>> v2:
> >>>>>> - Resend command when timeout.(Lijo)
> >>>>>> - Use debugfs file instead of module parameter.
> >>>>>>
> >>>>>> Signed-off-by: Lang Yu <lang.yu at amd.com>
> >>>>> Well the debugfs part looks really nice and clean now, but one
> >>>>> more comment below.
> >>>>>
> >>>>>> ---
> >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 3 +++
> >>>>>> drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h | 5 +++++
> >>>>>> drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
> >>>>>> drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 8 +++++++-
> >>>>>> 4 files changed, 17 insertions(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> >>>>>> index 164d6a9e9fbb..86cd888c7822 100644
> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> >>>>>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct
> >>>>>> amdgpu_device
> >>>>> *adev)
> >>>>>> if (!debugfs_initialized())
> >>>>>> return 0;
> >>>>>>
> >>>>>> + debugfs_create_bool("amdgpu_smu_debug", 0600, root,
> >>>>>> + &adev->smu.smu_debug_mode);
> >>>>>> +
> >>>>>> ent = debugfs_create_file("amdgpu_preempt_ib", 0600,
> root,
> >adev,
> >>>>>> &fops_ib_preempt);
> >>>>>> if (IS_ERR(ent)) {
> >>>>>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >>>>>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >>>>>> index f738f7dc20c9..50dbf5594a9d 100644
> >>>>>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >>>>>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> >>>>>> @@ -569,6 +569,11 @@ struct smu_context
> >>>>>> struct smu_user_dpm_profile user_dpm_profile;
> >>>>>>
> >>>>>> struct stb_context stb_context;
> >>>>>> + /*
> >>>>>> + * When enabled, it makes SMU errors fatal.
> >>>>>> + * (0 = disabled (default), 1 = enabled)
> >>>>>> + */
> >>>>>> + bool smu_debug_mode;
> >>>>>> };
> >>>>>>
> >>>>>> struct i2c_adapter;
> >>>>>> diff --git
> a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >>>>>> index 6e781cee8bb6..d3797a2d6451 100644
> >>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
> >>>>>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
> >>>>> smu_context *smu)
> >>>>>> out:
> >>>>>> mutex_unlock(&smu->message_lock);
> >>>>>>
> >>>>>> + BUG_ON(unlikely(smu->smu_debug_mode) && ret);
> >>>>>> +
> >>>>>> return ret;
> >>>>>> }
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> >>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> >>>>>> index 048ca1673863..9be005eb4241 100644
> >>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> >>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> >>>>>> @@ -349,15 +349,21 @@ int
> >smu_cmn_send_smc_msg_with_param(struct
> >>>>> smu_context *smu,
> >>>>>> __smu_cmn_reg_print_error(smu, reg, index, param,
> >msg);
> >>>>>> goto Out;
> >>>>>> }
> >>>>>> +
> >>>>>> __smu_cmn_send_msg(smu, (uint16_t) index, param);
> >>>>>> reg = __smu_cmn_poll_stat(smu);
> >>>>>> res = __smu_cmn_reg2errno(smu, reg);
> >>>>>> - if (res != 0)
> >>>>>> + if (res != 0) {
> >>>>>> __smu_cmn_reg_print_error(smu, reg, index, param,
> >msg);
> >>>>>> + goto Out;
> >>>>>> + }
> >>>>>> if (read_arg)
> >>>>>> smu_cmn_read_arg(smu, read_arg);
> >>>>>> Out:
> >>>>>> mutex_unlock(&smu->message_lock);
> >>>>>> +
> >>>>>> + BUG_ON(unlikely(smu->smu_debug_mode) && res);
> >>>>> BUG_ON() really crashes the kernel and is only allowed if we
> >>>>> prevent further data corruption with that.
> >>>>>
> >>>>> Most of the time WARN_ON() is more appropriate, but I can't fully
> >>>>> judge here since I don't know the SMU code well enough.
> >>>> This is what SMU FW guys want. They want "user-visible (potentially
> >>>> fatal)
> >>> errors", then a hang.
> >>>> They want to keep system state since the error occurred.
> >>> Well that is rather problematic.
> >>>
> >>> First of all we need to really justify that, crashing the kernel is
> >>> not something easily done.
> >>>
> >>> Then this isn't really effective here. What happens is that you
> >>> crash the kernel thread of the currently executing process, but it
> >>> is perfectly possible that another thread still tries to send
> >>> messages to the SMU. You need to have the BUG_ON() before dropping
> >>> the lock to make sure that this really gets the driver stuck in the current
> state.
> >> Thanks. I got it. I just thought it is a kenel panic.
> >> Could we use a panic() here?
> >
> >Potentially, but that might reboot the system automatically which is
> >probably not what you want either.
> >
> >How does the SMU firmware team gather the necessary information when
> a
> >problem occurs?
>
> As far as I know, they usually use a HDT to collect information.
> And they request a hang when error occurred in ticket.
> "Suggested error responses include pop-up windows (by x86 driver, if this is
> possible) or simply hanging after logging the error. "
[Quan, Evan] Maybe what they want is just a stable SMU state(like no more message issuing after failure).
If that's true, I think you can just bail out on __smu_cmn_poll_stat() failure(in smu_cmn_send_smc_msg_with_param() and smu_cmn_send_msg_without_waiting()).
That will prevent further message issuing to SMU.
BR
Evan
>
> Regards,
> Lang
>
> >
> >When they connect external hardware a BUG_ON() while holding the SMU
> >lock might work, but if they want to SSH into the system you should
> >probably rather set a flag that the hardware shouldn't be touched any
> >more by the driver and continue.
> >
> >Otherwise the box is most likely really unstable after this.
> >
> >Regards,
> >Christian.
> >
> >>
> >> Regards,
> >> Lang
> >>
> >>> Regards,
> >>> Christian.
> >>>
> >>>> Regards,
> >>>> Lang
> >>>>
> >>>>> Christian.
> >>>>>
> >>>>>> +
> >>>>>> return res;
> >>>>>> }
> >>>>>>
More information about the amd-gfx
mailing list