[PATCH] drm/amdgpu: add support to SMU debug option

Christian König christian.koenig at amd.com
Wed Dec 1 11:29:24 UTC 2021


Am 01.12.21 um 12:20 schrieb Yu, Lang:
> [AMD Official Use Only]
>
>> -----Original Message-----
>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>> Sent: Wednesday, December 1, 2021 6:49 PM
>> To: Yu, Lang <Lang.Yu at amd.com>; Koenig, Christian
>> <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Lazar, Lijo
>> <Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>
>> Am 01.12.21 um 11:44 schrieb Yu, Lang:
>>> [AMD Official Use Only]
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>> Sent: Wednesday, December 1, 2021 5:30 PM
>>>> To: Yu, Lang <Lang.Yu at amd.com>; amd-gfx at lists.freedesktop.org
>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Lazar, Lijo
>>>> <Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>>>
>>>> Am 01.12.21 um 10:24 schrieb Lang Yu:
>>>>> To maintain system error state when SMU errors occurred, which will
>>>>> aid in debugging SMU firmware issues, add SMU debug option support.
>>>>>
>>>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file.
>>>>> When enabled, it makes SMU errors fatal.
>>>>> It is disabled by default.
>>>>>
>>>>> == Command Guide ==
>>>>>
>>>>> 1, enable SMU debug option
>>>>>
>>>>>     # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>>
>>>>> 2, disable SMU debug option
>>>>>
>>>>>     # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>>
>>>>> v3:
>>>>>     - Use debugfs_create_bool().(Christian)
>>>>>     - Put variable into smu_context struct.
>>>>>     - Don't resend command when timeout.
>>>>>
>>>>> v2:
>>>>>     - Resend command when timeout.(Lijo)
>>>>>     - Use debugfs file instead of module parameter.
>>>>>
>>>>> Signed-off-by: Lang Yu <lang.yu at amd.com>
>>>> Well the debugfs part looks really nice and clean now, but one more
>>>> comment below.
>>>>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c        | 3 +++
>>>>>     drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h            | 5 +++++
>>>>>     drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>>>>>     drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c             | 8 +++++++-
>>>>>     4 files changed, 17 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>> index 164d6a9e9fbb..86cd888c7822 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device
>>>> *adev)
>>>>>     	if (!debugfs_initialized())
>>>>>     		return 0;
>>>>>
>>>>> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
>>>>> +				  &adev->smu.smu_debug_mode);
>>>>> +
>>>>>     	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
>>>>>     				  &fops_ib_preempt);
>>>>>     	if (IS_ERR(ent)) {
>>>>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>> index f738f7dc20c9..50dbf5594a9d 100644
>>>>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>> @@ -569,6 +569,11 @@ struct smu_context
>>>>>     	struct smu_user_dpm_profile user_dpm_profile;
>>>>>
>>>>>     	struct stb_context stb_context;
>>>>> +	/*
>>>>> +	 * When enabled, it makes SMU errors fatal.
>>>>> +	 * (0 = disabled (default), 1 = enabled)
>>>>> +	 */
>>>>> +	bool smu_debug_mode;
>>>>>     };
>>>>>
>>>>>     struct i2c_adapter;
>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>> index 6e781cee8bb6..d3797a2d6451 100644
>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
>>>> smu_context *smu)
>>>>>     out:
>>>>>     	mutex_unlock(&smu->message_lock);
>>>>>
>>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && ret);
>>>>> +
>>>>>     	return ret;
>>>>>     }
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>> index 048ca1673863..9be005eb4241 100644
>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>> @@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct
>>>> smu_context *smu,
>>>>>     		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
>>>>>     		goto Out;
>>>>>     	}
>>>>> +
>>>>>     	__smu_cmn_send_msg(smu, (uint16_t) index, param);
>>>>>     	reg = __smu_cmn_poll_stat(smu);
>>>>>     	res = __smu_cmn_reg2errno(smu, reg);
>>>>> -	if (res != 0)
>>>>> +	if (res != 0) {
>>>>>     		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
>>>>> +		goto Out;
>>>>> +	}
>>>>>     	if (read_arg)
>>>>>     		smu_cmn_read_arg(smu, read_arg);
>>>>>     Out:
>>>>>     	mutex_unlock(&smu->message_lock);
>>>>> +
>>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && res);
>>>> BUG_ON() really crashes the kernel and is only allowed if we prevent
>>>> further data corruption with that.
>>>>
>>>> Most of the time WARN_ON() is more appropriate, but I can't fully
>>>> judge here since I don't know the SMU code well enough.
>>> This is what SMU FW guys want. They want "user-visible (potentially fatal)
>> errors", then a hang.
>>> They want to keep system state since the error occurred.
>> Well that is rather problematic.
>>
>> First of all we need to really justify that, crashing the kernel is not something
>> easily done.
>>
>> Then this isn't really effective here. What happens is that you crash the kernel
>> thread of the currently executing process, but it is perfectly possible that another
>> thread still tries to send messages to the SMU. You need to have the BUG_ON()
>> before dropping the lock to make sure that this really gets the driver stuck in the
>> current state.
> Thanks. I got it. I just thought it is a kenel panic.
> Could we use a panic() here?

Potentially, but that might reboot the system automatically which is 
probably not what you want either.

How does the SMU firmware team gather the necessary information when a 
problem occurs?

When they connect external hardware a BUG_ON() while holding the SMU 
lock might work, but if they want to SSH into the system you should 
probably rather set a flag that the hardware shouldn't be touched any 
more by the driver and continue.

Otherwise the box is most likely really unstable after this.

Regards,
Christian.

>
> Regards,
> Lang
>
>> Regards,
>> Christian.
>>
>>> Regards,
>>> Lang
>>>
>>>> Christian.
>>>>
>>>>> +
>>>>>     	return res;
>>>>>     }
>>>>>



More information about the amd-gfx mailing list