[PATCH] drm/amdgpu: add support to SMU debug option

Yu, Lang Lang.Yu at amd.com
Thu Dec 2 03:12:18 UTC 2021


[AMD Official Use Only]



>-----Original Message-----
>From: Quan, Evan <Evan.Quan at amd.com>
>Sent: Thursday, December 2, 2021 10:48 AM
>To: Yu, Lang <Lang.Yu at amd.com>; Koenig, Christian
><Christian.Koenig at amd.com>; Christian König
><ckoenig.leichtzumerken at gmail.com>; amd-gfx at lists.freedesktop.org
>Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Lazar, Lijo
><Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>
>Subject: RE: [PATCH] drm/amdgpu: add support to SMU debug option
>
>[AMD Official Use Only]
>
>
>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Yu,
>> Lang
>> Sent: Wednesday, December 1, 2021 7:37 PM
>> To: Koenig, Christian <Christian.Koenig at amd.com>; Christian König
>> <ckoenig.leichtzumerken at gmail.com>; amd-gfx at lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Lazar, Lijo
>> <Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>
>> Subject: RE: [PATCH] drm/amdgpu: add support to SMU debug option
>>
>> [AMD Official Use Only]
>>
>>
>>
>> >-----Original Message-----
>> >From: Koenig, Christian <Christian.Koenig at amd.com>
>> >Sent: Wednesday, December 1, 2021 7:29 PM
>> >To: Yu, Lang <Lang.Yu at amd.com>; Christian König
>> ><ckoenig.leichtzumerken at gmail.com>; amd-gfx at lists.freedesktop.org
>> >Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Lazar, Lijo
>> ><Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>
>> >Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>> >
>> >Am 01.12.21 um 12:20 schrieb Yu, Lang:
>> >> [AMD Official Use Only]
>> >>
>> >>> -----Original Message-----
>> >>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>> >>> Sent: Wednesday, December 1, 2021 6:49 PM
>> >>> To: Yu, Lang <Lang.Yu at amd.com>; Koenig, Christian
>> >>> <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
>> >>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Lazar, Lijo
>> >>> <Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>
>> >>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>> >>>
>> >>> Am 01.12.21 um 11:44 schrieb Yu, Lang:
>> >>>> [AMD Official Use Only]
>> >>>>
>> >>>>
>> >>>>
>> >>>>> -----Original Message-----
>> >>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>> >>>>> Sent: Wednesday, December 1, 2021 5:30 PM
>> >>>>> To: Yu, Lang <Lang.Yu at amd.com>; amd-gfx at lists.freedesktop.org
>> >>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Lazar, Lijo
>> >>>>> <Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>
>> >>>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>> >>>>>
>> >>>>> Am 01.12.21 um 10:24 schrieb Lang Yu:
>> >>>>>> To maintain system error state when SMU errors occurred, which
>> >>>>>> will aid in debugging SMU firmware issues, add SMU debug option
>> support.
>> >>>>>>
>> >>>>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file.
>> >>>>>> When enabled, it makes SMU errors fatal.
>> >>>>>> It is disabled by default.
>> >>>>>>
>> >>>>>> == Command Guide ==
>> >>>>>>
>> >>>>>> 1, enable SMU debug option
>> >>>>>>
>> >>>>>>     # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>> >>>>>>
>> >>>>>> 2, disable SMU debug option
>> >>>>>>
>> >>>>>>     # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>> >>>>>>
>> >>>>>> v3:
>> >>>>>>     - Use debugfs_create_bool().(Christian)
>> >>>>>>     - Put variable into smu_context struct.
>> >>>>>>     - Don't resend command when timeout.
>> >>>>>>
>> >>>>>> v2:
>> >>>>>>     - Resend command when timeout.(Lijo)
>> >>>>>>     - Use debugfs file instead of module parameter.
>> >>>>>>
>> >>>>>> Signed-off-by: Lang Yu <lang.yu at amd.com>
>> >>>>> Well the debugfs part looks really nice and clean now, but one
>> >>>>> more comment below.
>> >>>>>
>> >>>>>> ---
>> >>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c        | 3 +++
>> >>>>>>     drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h            | 5 +++++
>> >>>>>>     drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>> >>>>>>     drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c             | 8 +++++++-
>> >>>>>>     4 files changed, 17 insertions(+), 1 deletion(-)
>> >>>>>>
>> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> >>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> >>>>>> index 164d6a9e9fbb..86cd888c7822 100644
>> >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> >>>>>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct
>> >>>>>> amdgpu_device
>> >>>>> *adev)
>> >>>>>>     	if (!debugfs_initialized())
>> >>>>>>     		return 0;
>> >>>>>>
>> >>>>>> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
>> >>>>>> +				  &adev->smu.smu_debug_mode);
>> >>>>>> +
>> >>>>>>     	ent = debugfs_create_file("amdgpu_preempt_ib", 0600,
>> root,
>> >adev,
>> >>>>>>     				  &fops_ib_preempt);
>> >>>>>>     	if (IS_ERR(ent)) {
>> >>>>>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> >>>>>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> >>>>>> index f738f7dc20c9..50dbf5594a9d 100644
>> >>>>>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> >>>>>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>> >>>>>> @@ -569,6 +569,11 @@ struct smu_context
>> >>>>>>     	struct smu_user_dpm_profile user_dpm_profile;
>> >>>>>>
>> >>>>>>     	struct stb_context stb_context;
>> >>>>>> +	/*
>> >>>>>> +	 * When enabled, it makes SMU errors fatal.
>> >>>>>> +	 * (0 = disabled (default), 1 = enabled)
>> >>>>>> +	 */
>> >>>>>> +	bool smu_debug_mode;
>> >>>>>>     };
>> >>>>>>
>> >>>>>>     struct i2c_adapter;
>> >>>>>> diff --git
>> a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> >>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> >>>>>> index 6e781cee8bb6..d3797a2d6451 100644
>> >>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> >>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>> >>>>>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
>> >>>>> smu_context *smu)
>> >>>>>>     out:
>> >>>>>>     	mutex_unlock(&smu->message_lock);
>> >>>>>>
>> >>>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && ret);
>> >>>>>> +
>> >>>>>>     	return ret;
>> >>>>>>     }
>> >>>>>>
>> >>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> >>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> >>>>>> index 048ca1673863..9be005eb4241 100644
>> >>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> >>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> >>>>>> @@ -349,15 +349,21 @@ int
>> >smu_cmn_send_smc_msg_with_param(struct
>> >>>>> smu_context *smu,
>> >>>>>>     		__smu_cmn_reg_print_error(smu, reg, index, param,
>> >msg);
>> >>>>>>     		goto Out;
>> >>>>>>     	}
>> >>>>>> +
>> >>>>>>     	__smu_cmn_send_msg(smu, (uint16_t) index, param);
>> >>>>>>     	reg = __smu_cmn_poll_stat(smu);
>> >>>>>>     	res = __smu_cmn_reg2errno(smu, reg);
>> >>>>>> -	if (res != 0)
>> >>>>>> +	if (res != 0) {
>> >>>>>>     		__smu_cmn_reg_print_error(smu, reg, index, param,
>> >msg);
>> >>>>>> +		goto Out;
>> >>>>>> +	}
>> >>>>>>     	if (read_arg)
>> >>>>>>     		smu_cmn_read_arg(smu, read_arg);
>> >>>>>>     Out:
>> >>>>>>     	mutex_unlock(&smu->message_lock);
>> >>>>>> +
>> >>>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && res);
>> >>>>> BUG_ON() really crashes the kernel and is only allowed if we
>> >>>>> prevent further data corruption with that.
>> >>>>>
>> >>>>> Most of the time WARN_ON() is more appropriate, but I can't
>> >>>>> fully judge here since I don't know the SMU code well enough.
>> >>>> This is what SMU FW guys want. They want "user-visible
>> >>>> (potentially
>> >>>> fatal)
>> >>> errors", then a hang.
>> >>>> They want to keep system state since the error occurred.
>> >>> Well that is rather problematic.
>> >>>
>> >>> First of all we need to really justify that, crashing the kernel
>> >>> is not something easily done.
>> >>>
>> >>> Then this isn't really effective here. What happens is that you
>> >>> crash the kernel thread of the currently executing process, but it
>> >>> is perfectly possible that another thread still tries to send
>> >>> messages to the SMU. You need to have the BUG_ON() before dropping
>> >>> the lock to make sure that this really gets the driver stuck in
>> >>> the current
>> state.
>> >> Thanks. I got it. I just thought it is a kenel panic.
>> >> Could we use a panic() here?
>> >
>> >Potentially, but that might reboot the system automatically which is
>> >probably not what you want either.
>> >
>> >How does the SMU firmware team gather the necessary information when
>> a
>> >problem occurs?
>>
>> As far as I know, they usually use a HDT to collect information.
>> And they request a hang when error occurred in ticket.
>> "Suggested error responses include pop-up windows (by x86 driver, if
>> this is
>> possible) or simply hanging after logging the error. "
>[Quan, Evan] Maybe what they want is just a stable SMU state(like no more
>message issuing after failure).
>If that's true, I think you can just bail out on __smu_cmn_poll_stat() failure(in
>smu_cmn_send_smc_msg_with_param() and
>smu_cmn_send_msg_without_waiting()).
>That will prevent further message issuing to SMU.

But it's difficult to distinguish normal timeout cases(see aldebaran_mode2_reset()).
Sometimes it's reasonable to return a timeout. The user wants to use a longer timeout.

Regards,
Lang

>
>BR
>Evan
>>
>> Regards,
>> Lang
>>
>> >
>> >When they connect external hardware a BUG_ON() while holding the SMU
>> >lock might work, but if they want to SSH into the system you should
>> >probably rather set a flag that the hardware shouldn't be touched any
>> >more by the driver and continue.
>> >
>> >Otherwise the box is most likely really unstable after this.
>> >
>> >Regards,
>> >Christian.
>> >
>> >>
>> >> Regards,
>> >> Lang
>> >>
>> >>> Regards,
>> >>> Christian.
>> >>>
>> >>>> Regards,
>> >>>> Lang
>> >>>>
>> >>>>> Christian.
>> >>>>>
>> >>>>>> +
>> >>>>>>     	return res;
>> >>>>>>     }
>> >>>>>>


More information about the amd-gfx mailing list