[PATCH] drm/amdgpu: add support to SMU debug option

Lazar, Lijo lijo.lazar at amd.com
Thu Dec 2 07:18:29 UTC 2021



On 12/2/2021 8:42 AM, Yu, Lang wrote:
> [AMD Official Use Only]
> 
> 
> 
>> -----Original Message-----
>> From: Quan, Evan <Evan.Quan at amd.com>
>> Sent: Thursday, December 2, 2021 10:48 AM
>> To: Yu, Lang <Lang.Yu at amd.com>; Koenig, Christian
>> <Christian.Koenig at amd.com>; Christian König
>> <ckoenig.leichtzumerken at gmail.com>; amd-gfx at lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Lazar, Lijo
>> <Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>
>> Subject: RE: [PATCH] drm/amdgpu: add support to SMU debug option
>>
>> [AMD Official Use Only]
>>
>>
>>
>>> -----Original Message-----
>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Yu,
>>> Lang
>>> Sent: Wednesday, December 1, 2021 7:37 PM
>>> To: Koenig, Christian <Christian.Koenig at amd.com>; Christian König
>>> <ckoenig.leichtzumerken at gmail.com>; amd-gfx at lists.freedesktop.org
>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Lazar, Lijo
>>> <Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>
>>> Subject: RE: [PATCH] drm/amdgpu: add support to SMU debug option
>>>
>>> [AMD Official Use Only]
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>> Sent: Wednesday, December 1, 2021 7:29 PM
>>>> To: Yu, Lang <Lang.Yu at amd.com>; Christian König
>>>> <ckoenig.leichtzumerken at gmail.com>; amd-gfx at lists.freedesktop.org
>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Lazar, Lijo
>>>> <Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>>>
>>>> Am 01.12.21 um 12:20 schrieb Yu, Lang:
>>>>> [AMD Official Use Only]
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>>>>>> Sent: Wednesday, December 1, 2021 6:49 PM
>>>>>> To: Yu, Lang <Lang.Yu at amd.com>; Koenig, Christian
>>>>>> <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
>>>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Lazar, Lijo
>>>>>> <Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>
>>>>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>>>>>
>>>>>> Am 01.12.21 um 11:44 schrieb Yu, Lang:
>>>>>>> [AMD Official Use Only]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>>> Sent: Wednesday, December 1, 2021 5:30 PM
>>>>>>>> To: Yu, Lang <Lang.Yu at amd.com>; amd-gfx at lists.freedesktop.org
>>>>>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Lazar, Lijo
>>>>>>>> <Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>
>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>>>>>>>
>>>>>>>> Am 01.12.21 um 10:24 schrieb Lang Yu:
>>>>>>>>> To maintain system error state when SMU errors occurred, which
>>>>>>>>> will aid in debugging SMU firmware issues, add SMU debug option
>>> support.
>>>>>>>>>
>>>>>>>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file.
>>>>>>>>> When enabled, it makes SMU errors fatal.
>>>>>>>>> It is disabled by default.
>>>>>>>>>
>>>>>>>>> == Command Guide ==
>>>>>>>>>
>>>>>>>>> 1, enable SMU debug option
>>>>>>>>>
>>>>>>>>>      # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>>>>>>
>>>>>>>>> 2, disable SMU debug option
>>>>>>>>>
>>>>>>>>>      # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>>>>>>
>>>>>>>>> v3:
>>>>>>>>>      - Use debugfs_create_bool().(Christian)
>>>>>>>>>      - Put variable into smu_context struct.
>>>>>>>>>      - Don't resend command when timeout.
>>>>>>>>>
>>>>>>>>> v2:
>>>>>>>>>      - Resend command when timeout.(Lijo)
>>>>>>>>>      - Use debugfs file instead of module parameter.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Lang Yu <lang.yu at amd.com>
>>>>>>>> Well the debugfs part looks really nice and clean now, but one
>>>>>>>> more comment below.
>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c        | 3 +++
>>>>>>>>>      drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h            | 5 +++++
>>>>>>>>>      drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>>>>>>>>>      drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c             | 8 +++++++-
>>>>>>>>>      4 files changed, 17 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>>> index 164d6a9e9fbb..86cd888c7822 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct
>>>>>>>>> amdgpu_device
>>>>>>>> *adev)
>>>>>>>>>      	if (!debugfs_initialized())
>>>>>>>>>      		return 0;
>>>>>>>>>
>>>>>>>>> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
>>>>>>>>> +				  &adev->smu.smu_debug_mode);
>>>>>>>>> +
>>>>>>>>>      	ent = debugfs_create_file("amdgpu_preempt_ib", 0600,
>>> root,
>>>> adev,
>>>>>>>>>      				  &fops_ib_preempt);
>>>>>>>>>      	if (IS_ERR(ent)) {
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>>>>>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>>>>>> index f738f7dc20c9..50dbf5594a9d 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>>>>>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>>>>>> @@ -569,6 +569,11 @@ struct smu_context
>>>>>>>>>      	struct smu_user_dpm_profile user_dpm_profile;
>>>>>>>>>
>>>>>>>>>      	struct stb_context stb_context;
>>>>>>>>> +	/*
>>>>>>>>> +	 * When enabled, it makes SMU errors fatal.
>>>>>>>>> +	 * (0 = disabled (default), 1 = enabled)
>>>>>>>>> +	 */
>>>>>>>>> +	bool smu_debug_mode;
>>>>>>>>>      };
>>>>>>>>>
>>>>>>>>>      struct i2c_adapter;
>>>>>>>>> diff --git
>>> a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>>>>>> index 6e781cee8bb6..d3797a2d6451 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>>>>>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
>>>>>>>> smu_context *smu)
>>>>>>>>>      out:
>>>>>>>>>      	mutex_unlock(&smu->message_lock);
>>>>>>>>>
>>>>>>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && ret);
>>>>>>>>> +
>>>>>>>>>      	return ret;
>>>>>>>>>      }
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>>>>> index 048ca1673863..9be005eb4241 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>>>>> @@ -349,15 +349,21 @@ int
>>>> smu_cmn_send_smc_msg_with_param(struct
>>>>>>>> smu_context *smu,
>>>>>>>>>      		__smu_cmn_reg_print_error(smu, reg, index, param,
>>>> msg);
>>>>>>>>>      		goto Out;
>>>>>>>>>      	}
>>>>>>>>> +
>>>>>>>>>      	__smu_cmn_send_msg(smu, (uint16_t) index, param);
>>>>>>>>>      	reg = __smu_cmn_poll_stat(smu);
>>>>>>>>>      	res = __smu_cmn_reg2errno(smu, reg);
>>>>>>>>> -	if (res != 0)
>>>>>>>>> +	if (res != 0) {
>>>>>>>>>      		__smu_cmn_reg_print_error(smu, reg, index, param,
>>>> msg);
>>>>>>>>> +		goto Out;
>>>>>>>>> +	}
>>>>>>>>>      	if (read_arg)
>>>>>>>>>      		smu_cmn_read_arg(smu, read_arg);
>>>>>>>>>      Out:
>>>>>>>>>      	mutex_unlock(&smu->message_lock);
>>>>>>>>> +
>>>>>>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && res);
>>>>>>>> BUG_ON() really crashes the kernel and is only allowed if we
>>>>>>>> prevent further data corruption with that.
>>>>>>>>
>>>>>>>> Most of the time WARN_ON() is more appropriate, but I can't
>>>>>>>> fully judge here since I don't know the SMU code well enough.
>>>>>>> This is what SMU FW guys want. They want "user-visible
>>>>>>> (potentially
>>>>>>> fatal)
>>>>>> errors", then a hang.
>>>>>>> They want to keep system state since the error occurred.
>>>>>> Well that is rather problematic.
>>>>>>
>>>>>> First of all we need to really justify that, crashing the kernel
>>>>>> is not something easily done.
>>>>>>
>>>>>> Then this isn't really effective here. What happens is that you
>>>>>> crash the kernel thread of the currently executing process, but it
>>>>>> is perfectly possible that another thread still tries to send
>>>>>> messages to the SMU. You need to have the BUG_ON() before dropping
>>>>>> the lock to make sure that this really gets the driver stuck in
>>>>>> the current
>>> state.
>>>>> Thanks. I got it. I just thought it is a kenel panic.
>>>>> Could we use a panic() here?
>>>>
>>>> Potentially, but that might reboot the system automatically which is
>>>> probably not what you want either.
>>>>
>>>> How does the SMU firmware team gather the necessary information when
>>> a
>>>> problem occurs?
>>>
>>> As far as I know, they usually use a HDT to collect information.
>>> And they request a hang when error occurred in ticket.
>>> "Suggested error responses include pop-up windows (by x86 driver, if
>>> this is
>>> possible) or simply hanging after logging the error. "
>> [Quan, Evan] Maybe what they want is just a stable SMU state(like no more
>> message issuing after failure).
>> If that's true, I think you can just bail out on __smu_cmn_poll_stat() failure(in
>> smu_cmn_send_smc_msg_with_param() and
>> smu_cmn_send_msg_without_waiting()).
>> That will prevent further message issuing to SMU.
> 
> But it's difficult to distinguish normal timeout cases(see aldebaran_mode2_reset()).
> Sometimes it's reasonable to return a timeout. The user wants to use a longer timeout.
> 

Other than driver, other FWs also will interact with PMFW and this 
itself may not take care of that. The ideal thing is to stop driver jobs 
and further execution to preserve the fail state. BUG() may be the 
easiest if the expectation is to use external hardware to check the fail 
state.

Thanks,
Lijo

> Regards,
> Lang
> 
>>
>> BR
>> Evan
>>>
>>> Regards,
>>> Lang
>>>
>>>>
>>>> When they connect external hardware a BUG_ON() while holding the SMU
>>>> lock might work, but if they want to SSH into the system you should
>>>> probably rather set a flag that the hardware shouldn't be touched any
>>>> more by the driver and continue.
>>>>
>>>> Otherwise the box is most likely really unstable after this.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> Regards,
>>>>> Lang
>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> Regards,
>>>>>>> Lang
>>>>>>>
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>> +
>>>>>>>>>      	return res;
>>>>>>>>>      }
>>>>>>>>>


More information about the amd-gfx mailing list