[PATCH] drm/amdgpu: add support to SMU debug option
Yu, Lang
Lang.Yu at amd.com
Wed Dec 1 11:37:24 UTC 2021
[AMD Official Use Only]
>-----Original Message-----
>From: Koenig, Christian <Christian.Koenig at amd.com>
>Sent: Wednesday, December 1, 2021 7:29 PM
>To: Yu, Lang <Lang.Yu at amd.com>; Christian König
><ckoenig.leichtzumerken at gmail.com>; amd-gfx at lists.freedesktop.org
>Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Lazar, Lijo
><Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>
>Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>
>Am 01.12.21 um 12:20 schrieb Yu, Lang:
>> [AMD Official Use Only]
>>
>>> -----Original Message-----
>>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>>> Sent: Wednesday, December 1, 2021 6:49 PM
>>> To: Yu, Lang <Lang.Yu at amd.com>; Koenig, Christian
>>> <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Lazar, Lijo
>>> <Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>>
>>> Am 01.12.21 um 11:44 schrieb Yu, Lang:
>>>> [AMD Official Use Only]
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Koenig, Christian <Christian.Koenig at amd.com>
>>>>> Sent: Wednesday, December 1, 2021 5:30 PM
>>>>> To: Yu, Lang <Lang.Yu at amd.com>; amd-gfx at lists.freedesktop.org
>>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Lazar, Lijo
>>>>> <Lijo.Lazar at amd.com>; Huang, Ray <Ray.Huang at amd.com>
>>>>> Subject: Re: [PATCH] drm/amdgpu: add support to SMU debug option
>>>>>
>>>>> Am 01.12.21 um 10:24 schrieb Lang Yu:
>>>>>> To maintain system error state when SMU errors occurred, which
>>>>>> will aid in debugging SMU firmware issues, add SMU debug option support.
>>>>>>
>>>>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file.
>>>>>> When enabled, it makes SMU errors fatal.
>>>>>> It is disabled by default.
>>>>>>
>>>>>> == Command Guide ==
>>>>>>
>>>>>> 1, enable SMU debug option
>>>>>>
>>>>>> # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>>>
>>>>>> 2, disable SMU debug option
>>>>>>
>>>>>> # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>>>
>>>>>> v3:
>>>>>> - Use debugfs_create_bool().(Christian)
>>>>>> - Put variable into smu_context struct.
>>>>>> - Don't resend command when timeout.
>>>>>>
>>>>>> v2:
>>>>>> - Resend command when timeout.(Lijo)
>>>>>> - Use debugfs file instead of module parameter.
>>>>>>
>>>>>> Signed-off-by: Lang Yu <lang.yu at amd.com>
>>>>> Well the debugfs part looks really nice and clean now, but one more
>>>>> comment below.
>>>>>
>>>>>> ---
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 3 +++
>>>>>> drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h | 5 +++++
>>>>>> drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c | 2 ++
>>>>>> drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 8 +++++++-
>>>>>> 4 files changed, 17 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>> index 164d6a9e9fbb..86cd888c7822 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>> @@ -1618,6 +1618,9 @@ int amdgpu_debugfs_init(struct amdgpu_device
>>>>> *adev)
>>>>>> if (!debugfs_initialized())
>>>>>> return 0;
>>>>>>
>>>>>> + debugfs_create_bool("amdgpu_smu_debug", 0600, root,
>>>>>> + &adev->smu.smu_debug_mode);
>>>>>> +
>>>>>> ent = debugfs_create_file("amdgpu_preempt_ib", 0600, root,
>adev,
>>>>>> &fops_ib_preempt);
>>>>>> if (IS_ERR(ent)) {
>>>>>> diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>>> b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>>> index f738f7dc20c9..50dbf5594a9d 100644
>>>>>> --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>>> +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
>>>>>> @@ -569,6 +569,11 @@ struct smu_context
>>>>>> struct smu_user_dpm_profile user_dpm_profile;
>>>>>>
>>>>>> struct stb_context stb_context;
>>>>>> + /*
>>>>>> + * When enabled, it makes SMU errors fatal.
>>>>>> + * (0 = disabled (default), 1 = enabled)
>>>>>> + */
>>>>>> + bool smu_debug_mode;
>>>>>> };
>>>>>>
>>>>>> struct i2c_adapter;
>>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>>> index 6e781cee8bb6..d3797a2d6451 100644
>>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/aldebaran_ppt.c
>>>>>> @@ -1919,6 +1919,8 @@ static int aldebaran_mode2_reset(struct
>>>>> smu_context *smu)
>>>>>> out:
>>>>>> mutex_unlock(&smu->message_lock);
>>>>>>
>>>>>> + BUG_ON(unlikely(smu->smu_debug_mode) && ret);
>>>>>> +
>>>>>> return ret;
>>>>>> }
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>> index 048ca1673863..9be005eb4241 100644
>>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>> @@ -349,15 +349,21 @@ int
>smu_cmn_send_smc_msg_with_param(struct
>>>>> smu_context *smu,
>>>>>> __smu_cmn_reg_print_error(smu, reg, index, param,
>msg);
>>>>>> goto Out;
>>>>>> }
>>>>>> +
>>>>>> __smu_cmn_send_msg(smu, (uint16_t) index, param);
>>>>>> reg = __smu_cmn_poll_stat(smu);
>>>>>> res = __smu_cmn_reg2errno(smu, reg);
>>>>>> - if (res != 0)
>>>>>> + if (res != 0) {
>>>>>> __smu_cmn_reg_print_error(smu, reg, index, param,
>msg);
>>>>>> + goto Out;
>>>>>> + }
>>>>>> if (read_arg)
>>>>>> smu_cmn_read_arg(smu, read_arg);
>>>>>> Out:
>>>>>> mutex_unlock(&smu->message_lock);
>>>>>> +
>>>>>> + BUG_ON(unlikely(smu->smu_debug_mode) && res);
>>>>> BUG_ON() really crashes the kernel and is only allowed if we
>>>>> prevent further data corruption with that.
>>>>>
>>>>> Most of the time WARN_ON() is more appropriate, but I can't fully
>>>>> judge here since I don't know the SMU code well enough.
>>>> This is what SMU FW guys want. They want "user-visible (potentially
>>>> fatal)
>>> errors", then a hang.
>>>> They want to keep system state since the error occurred.
>>> Well that is rather problematic.
>>>
>>> First of all we need to really justify that, crashing the kernel is
>>> not something easily done.
>>>
>>> Then this isn't really effective here. What happens is that you crash
>>> the kernel thread of the currently executing process, but it is
>>> perfectly possible that another thread still tries to send messages
>>> to the SMU. You need to have the BUG_ON() before dropping the lock to
>>> make sure that this really gets the driver stuck in the current state.
>> Thanks. I got it. I just thought it is a kenel panic.
>> Could we use a panic() here?
>
>Potentially, but that might reboot the system automatically which is probably not
>what you want either.
>
>How does the SMU firmware team gather the necessary information when a
>problem occurs?
As far as I know, they usually use a HDT to collect information.
And they request a hang when error occurred in ticket.
"Suggested error responses include pop-up windows (by x86 driver, if this is possible) or simply hanging after logging the error. "
Regards,
Lang
>
>When they connect external hardware a BUG_ON() while holding the SMU lock
>might work, but if they want to SSH into the system you should probably rather
>set a flag that the hardware shouldn't be touched any more by the driver and
>continue.
>
>Otherwise the box is most likely really unstable after this.
>
>Regards,
>Christian.
>
>>
>> Regards,
>> Lang
>>
>>> Regards,
>>> Christian.
>>>
>>>> Regards,
>>>> Lang
>>>>
>>>>> Christian.
>>>>>
>>>>>> +
>>>>>> return res;
>>>>>> }
>>>>>>
More information about the amd-gfx
mailing list