[PATCH] drm/amdgpu: add support to SMU debug option

Yu, Lang Lang.Yu at amd.com
Wed Dec 1 11:20:48 UTC 2021


[AMD Official Use Only]



>-----Original Message-----
>From: Christian König <ckoenig.leichtzumerken at gmail.com>
>Sent: Wednesday, December 1, 2021 6:49 PM
>To: Yu, Lang <Lang.Yu at amd.com>; Koenig, Christian
><Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
>Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Lazar, Lijo
><Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>
>Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>
>Am 01.12.21 um 11:44 schrieb Yu, Lang:
>> [AMD Official Use Only]
>>
>>
>>
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>> Sent: Wednesday, December 1, 2021 5:30 PM
>>> To: Yu, Lang <Lang.Yu at amd.com>; amd-gfx at lists.freedesktop.org
>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Lazar, Lijo
>>> <Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>>
>>> Am 01.12.21 um 10:24 schrieb Lang Yu:
>>>> To maintain system error state when SMU errors occurred, which will
>>>> aid in debugging SMU firmware issues, add SMU debug option support.
>>>>
>>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file.
>>>> When enabled, it makes SMU errors fatal.
>>>> It is disabled by default.
>>>>
>>>> == Command Guide ==
>>>>
>>>> 1, enable SMU debug option
>>>>
>>>>    # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>
>>>> 2, disable SMU debug option
>>>>
>>>>    # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>
>>>> v3:
>>>>    - Use debugfs_create_bool().(Christian)
>>>>    - Put variable into smu_context struct.
>>>>    - Don't resend command when timeout.
>>>>
>>>> v2:
>>>>    - Resend command when timeout.(Lijo)
>>>>    - Use debugfs file instead of module parameter.
>>>>
>>>> Signed-off-by: Lang Yu <lang.yu at amd.com>
>>> Well the debugfs part looks really nice and clean now, but one more
>>> comment below.
>>>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c        | 3 +++
>>>>    drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h            | 5 +++++
>>>>    drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>>>>    drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c             | 8 +++++++-
>>>>    4 files changed, 17 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>> index 164d6a9e9fbb..86cd888c7822 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device
>>> *adev)
>>>>    	if (!debugfs_initialized())
>>>>    		return 0;
>>>>
>>>> +	debugfs_create_bool("amdgpu_smu_debug", 0600, root,
>>>> +				  &adev->smu.smu_debug_mode);
>>>> +
>>>>    	ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root, adev,
>>>>    				  &fops_ib_preempt);
>>>>    	if (IS_ERR(ent)) {
>>>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>> index f738f7dc20c9..50dbf5594a9d 100644
>>>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>> @@ -569,6 +569,11 @@ struct smu_context
>>>>    	struct smu_user_dpm_profile user_dpm_profile;
>>>>
>>>>    	struct stb_context stb_context;
>>>> +	/*
>>>> +	 * When enabled, it makes SMU errors fatal.
>>>> +	 * (0 = disabled (default), 1 = enabled)
>>>> +	 */
>>>> +	bool smu_debug_mode;
>>>>    };
>>>>
>>>>    struct i2c_adapter;
>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>> index 6e781cee8bb6..d3797a2d6451 100644
>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
>>> smu_context *smu)
>>>>    out:
>>>>    	mutex_unlock(&smu->message_lock);
>>>>
>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && ret);
>>>> +
>>>>    	return ret;
>>>>    }
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>> index 048ca1673863..9be005eb4241 100644
>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>> @@ -349,15 +349,21 @@ int smu_cmn_send_smc_msg_with_param(struct
>>> smu_context *smu,
>>>>    		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
>>>>    		goto Out;
>>>>    	}
>>>> +
>>>>    	__smu_cmn_send_msg(smu, (uint16_t) index, param);
>>>>    	reg = __smu_cmn_poll_stat(smu);
>>>>    	res = __smu_cmn_reg2errno(smu, reg);
>>>> -	if (res != 0)
>>>> +	if (res != 0) {
>>>>    		__smu_cmn_reg_print_error(smu, reg, index, param, msg);
>>>> +		goto Out;
>>>> +	}
>>>>    	if (read_arg)
>>>>    		smu_cmn_read_arg(smu, read_arg);
>>>>    Out:
>>>>    	mutex_unlock(&smu->message_lock);
>>>> +
>>>> +	BUG_ON(unlikely(smu->smu_debug_mode) && res);
>>> BUG_ON() really crashes the kernel and is only allowed if we prevent
>>> further data corruption with that.
>>>
>>> Most of the time WARN_ON() is more appropriate, but I can't fully
>>> judge here since I don't know the SMU code well enough.
>> This is what SMU FW guys want. They want "user-visible (potentially fatal)
>errors", then a hang.
>> They want to keep system state since the error occurred.
>
>Well that is rather problematic.
>
>First of all we need to really justify that, crashing the kernel is not something
>easily done.
>
>Then this isn't really effective here. What happens is that you crash the kernel
>thread of the currently executing process, but it is perfectly possible that another
>thread still tries to send messages to the SMU. You need to have the BUG_ON()
>before dropping the lock to make sure that this really gets the driver stuck in the
>current state.

Thanks. I got it. I just thought it is a kenel panic.
Could we use a panic() here? 

Regards,
Lang

>Regards,
>Christian.
>
>>
>> Regards,
>> Lang
>>
>>> Christian.
>>>
>>>> +
>>>>    	return res;
>>>>    }
>>>>


More information about the amd-gfx mailing list