[PATCH] drm/amdgpu: add SMU debug option support

Lazar, Lijo lijo.lazar at amd.com
Wed Dec 1 08:46:16 UTC 2021



On 12/1/2021 1:48 PM, Yu, Lang wrote:
> [AMD Official Use Only]
> 
> 
> 
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Yu, Lang
>> Sent: Wednesday, December 1, 2021 3:58 PM
>> To: Lazar, Lijo <Lijo.Lazar at amd.com>; amd-gfx at lists.freedesktop.org
>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Huang, Ray
>> <Ray.Huang at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>
>> Subject: RE: [PATCH] drm/amdgpu: add SMU debug option support
>>
>> [AMD Official Use Only]
>>
>>
>>
>>> -----Original Message-----
>>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>>> Sent: Wednesday, December 1, 2021 3:28 PM
>>> To: Yu, Lang <Lang.Yu at amd.com>; amd-gfx at lists.freedesktop.org
>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Huang, Ray
>>> <Ray.Huang at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: add SMU debug option support
>>>
>>>
>>>
>>> On 12/1/2021 12:37 PM, Yu, Lang wrote:
>>>> [AMD Official Use Only]
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>>>>> Sent: Wednesday, December 1, 2021 2:56 PM
>>>>> To: Yu, Lang <Lang.Yu at amd.com>; amd-gfx at lists.freedesktop.org
>>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Huang, Ray
>>>>> <Ray.Huang at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>
>>>>> Subject: Re: [PATCH] drm/amdgpu: add SMU debug option support
>>>>>
>>>>>
>>>>>
>>>>> On 12/1/2021 11:57 AM, Yu, Lang wrote:
>>>>>> [AMD Official Use Only]
>>>>>>
>>>>>> Hi Lijo,
>>>>>>
>>>>>> Thanks for your comments.
>>>>>>
>>>>>>    From my understanding, that just increases the timeout threshold
>>>>>> and could hide some potential issues which should be exposed and solved.
>>>>>>
>>>>>> If current timeout threshold is not enough for some corner cases,
>>>>>> (1) Do we consider to increase the threshold to cover these cases?
>>>>>> (2) Or do we just expose them and request SMU FW to optimize them?
>>>>>>
>>>>>> I think it doesn't make much sense to increase the threshold in debug mode.
>>>>>> How do you think? Thanks!
>>>>>
>>>>> In normal cases, 2secs would be more than enough. If we hang
>>>>> immediately, then check the FW registers later, the response would
>>>>> have come. I thought we just need to note those cases and not to
>>>>> fail everytime. Just to mark as a red flag in the log to tell us
>>>>> that FW is unexpectedly busy processing something else when the message is
>> sent.
>>>>>
>>>>> There are some issues related to S0ix where we see the FW comes back
>>>>> with a response with an increased timeout under certain conditions.
>>>>
>>>> If these issues still exists, could we just blacklist the tests that
>>>> triggered them before solve them? Or we just increase the threshold
>>>> to cover
>>> all the cases?
>>>>
>>>
>>> Actually, the timeout is message specific - like i2c transfer from
>>> EEPROM could take longer time.
>>>
>>> I am not sure if we should have more than 2s as timeout. Whenever this
>>> kind of issue happens, FW team check registers (then it will have a
>>> proper value) and say they don't see anything abnormal :) Usually,
>>> those are just signs of crack and it eventually breaks.
>>>
>>> Option is just fail immediately (then again not sure useful it will be
>>> if the issue is this sort of thing) or wait to see how far it goes with
>>> an added timeout before it fails eventually.
>>
>> Are smu_cmn_wait_for_response()/smu_cmn_send_msg_without_waiting()
>> designed for long timeout cases? Is it fine that we don't fail here in the event of
>> timeout?
> 
> Or we can add a timeout parameter into smu_cmn_send_smc_msg_with_param()
> to specify the timeout you want for specific message.
> I think this may be another story. Thanks!
>

Yes, that will be a different patch. For now, skip the extended timeout. 
Every timeout will trigger a debug alarm and let it be that way for 
debug mode. I think you can skip the retry also (originally I meant this 
by that comment - retry again for response reg check).

Thanks,
Lijo

> Thanks,
> Lang
>>
>>>
>>> Thanks,
>>> Lijo
>>>
>>>> Regards,
>>>> Lang
>>>>
>>>>>
>>>>> Thanks,
>>>>> Lijo
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Lang
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>>>>>>> Sent: Wednesday, December 1, 2021 1:44 PM
>>>>>>> To: Lazar, Lijo <Lijo.Lazar at amd.com>; Yu, Lang <Lang.Yu at amd.com>;
>>>>>>> amd- gfx at lists.freedesktop.org
>>>>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Huang, Ray
>>>>>>> <Ray.Huang at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>> Subject: RE: [PATCH] drm/amdgpu: add SMU debug option support
>>>>>>>
>>>>>>> Just realized that the patch I pasted won't work. Outer loop exit
>>>>>>> needs to be like this.
>>>>>>> 	(reg & MP1_C2PMSG_90__CONTENT_MASK) != 0 && extended_wait-- >=
>>>>>>> 0
>>>>>>>
>>>>>>> Anyway, that patch is only there to communicate what I really
>>>>>>> meant in the earlier comment.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Lijo
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of
>>>>>>> Lazar, Lijo
>>>>>>> Sent: Wednesday, December 1, 2021 10:44 AM
>>>>>>> To: Yu, Lang <Lang.Yu at amd.com>; amd-gfx at lists.freedesktop.org
>>>>>>> Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Huang, Ray
>>>>>>> <Ray.Huang at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>
>>>>>>> Subject: Re: [PATCH] drm/amdgpu: add SMU debug option support
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 11/30/2021 10:47 AM, Lang Yu wrote:
>>>>>>>> To maintain system error state when SMU errors occurred, which
>>>>>>>> will aid in debugging SMU firmware issues, add SMU debug option
>> support.
>>>>>>>>
>>>>>>>> It can be enabled or disabled via amdgpu_smu_debug debugfs file.
>>>>>>>> When enabled, it makes SMU errors fatal.
>>>>>>>> It is disabled by default.
>>>>>>>>
>>>>>>>> == Command Guide ==
>>>>>>>>
>>>>>>>> 1, enable SMU debug option
>>>>>>>>
>>>>>>>>      # echo 1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>>>>>
>>>>>>>> 2, disable SMU debug option
>>>>>>>>
>>>>>>>>      # echo 0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug
>>>>>>>>
>>>>>>>> v2:
>>>>>>>>      - Resend command when timeout.(Lijo)
>>>>>>>>      - Use debugfs file instead of module parameter.
>>>>>>>>
>>>>>>>> Signed-off-by: Lang Yu <lang.yu at amd.com>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 32
>>>>> +++++++++++++++++
>>>>>>>>      drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c      | 39
>>>>> +++++++++++++++++++-
>>>>>>> -
>>>>>>>>      2 files changed, 69 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>> index 164d6a9e9fbb..f9412de86599 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>> @@ -39,6 +39,8 @@
>>>>>>>>
>>>>>>>>      #if defined(CONFIG_DEBUG_FS)
>>>>>>>>
>>>>>>>> +extern int amdgpu_smu_debug;
>>>>>>>> +
>>>>>>>>      /**
>>>>>>>>       * amdgpu_debugfs_process_reg_op - Handle MMIO register
>>> reads/writes
>>>>>>>>       *
>>>>>>>> @@ -1152,6 +1154,8 @@ static ssize_t
>>>>>>>> amdgpu_debugfs_gfxoff_read(struct
>>>>>>> file *f, char __user *buf,
>>>>>>>>      	return result;
>>>>>>>>      }
>>>>>>>>
>>>>>>>> +
>>>>>>>> +
>>>>>>>>      static const struct file_operations amdgpu_debugfs_regs2_fops = {
>>>>>>>>      	.owner = THIS_MODULE,
>>>>>>>>      	.unlocked_ioctl = amdgpu_debugfs_regs2_ioctl, @@ -1609,6
>>>>>>>> +1613,26 @@ DEFINE_DEBUGFS_ATTRIBUTE(fops_ib_preempt, NULL,
>>>>>>>>      DEFINE_DEBUGFS_ATTRIBUTE(fops_sclk_set, NULL,
>>>>>>>>      			amdgpu_debugfs_sclk_set, "%llu\n");
>>>>>>>>
>>>>>>>> +static int amdgpu_debugfs_smu_debug_get(void *data, u64 *val) {
>>>>>>>> +	*val = amdgpu_smu_debug;
>>>>>>>> +	return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static int amdgpu_debugfs_smu_debug_set(void *data, u64 val) {
>>>>>>>> +	if (val != 0 && val != 1)
>>>>>>>> +		return -EINVAL;
>>>>>>>> +
>>>>>>>> +	amdgpu_smu_debug = val;
>>>>>>>> +	return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +DEFINE_DEBUGFS_ATTRIBUTE(fops_smu_debug,
>>>>>>>> +			 amdgpu_debugfs_smu_debug_get,
>>>>>>>> +			 amdgpu_debugfs_smu_debug_set,
>>>>>>>> +			 "%llu\n");
>>>>>>>> +
>>>>>>>>      int amdgpu_debugfs_init(struct amdgpu_device *adev)
>>>>>>>>      {
>>>>>>>>      	struct dentry *root =
>>>>>>>> adev_to_drm(adev)->primary->debugfs_root;
>>>>>>>> @@ -1632,6 +1656,14 @@ int amdgpu_debugfs_init(struct
>>>>>>>> amdgpu_device
>>>>>>> *adev)
>>>>>>>>      		return PTR_ERR(ent);
>>>>>>>>      	}
>>>>>>>>
>>>>>>>> +	ent = debugfs_create_file("amdgpu_smu_debug", 0600, root,
>> adev,
>>>>>>>> +				  &fops_smu_debug);
>>>>>>>> +	if (IS_ERR(ent)) {
>>>>>>>> +		DRM_ERROR("unable to create amdgpu_smu_debug
>> debugsfs
>>>>>>> file\n");
>>>>>>>> +		return PTR_ERR(ent);
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>> +
>>>>>>>>      	/* Register debugfs entries for amdgpu_ttm */
>>>>>>>>      	amdgpu_ttm_debugfs_init(adev);
>>>>>>>>      	amdgpu_debugfs_pm_init(adev); diff --git
>>>>>>>> a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>>>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>>>> index 048ca1673863..b3969d7933d3 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>>>>>>> @@ -55,6 +55,14 @@
>>>>>>>>
>>>>>>>>      #undef __SMU_DUMMY_MAP
>>>>>>>>      #define __SMU_DUMMY_MAP(type)	#type
>>>>>>>> +
>>>>>>>> +/*
>>>>>>>> + * Used to enable SMU debug option. When enabled, it makes SMU
>>>>>>>> +errors
>>>>>>> fatal.
>>>>>>>> + * This will aid in debugging SMU firmware issues.
>>>>>>>> + * (0 = disabled (default), 1 = enabled)  */ int
>>>>>>>> + amdgpu_smu_debug;
>>>>>>>> +
>>>>>>>>      static const char * const __smu_message_names[] = {
>>>>>>>>      	SMU_MESSAGE_TYPES
>>>>>>>>      };
>>>>>>>> @@ -272,6 +280,11 @@ int smu_cmn_send_msg_without_waiting(struct
>>>>>>> smu_context *smu,
>>>>>>>>      	__smu_cmn_send_msg(smu, msg_index, param);
>>>>>>>>      	res = 0;
>>>>>>>>      Out:
>>>>>>>> +	if (unlikely(amdgpu_smu_debug == 1) && res) {
>>>>>>>> +		mutex_unlock(&smu->message_lock);
>>>>>>>> +		BUG();
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>>      	return res;
>>>>>>>>      }
>>>>>>>>
>>>>>>>> @@ -288,9 +301,17 @@ int smu_cmn_send_msg_without_waiting(struct
>>>>>>> smu_context *smu,
>>>>>>>>      int smu_cmn_wait_for_response(struct smu_context *smu)
>>>>>>>>      {
>>>>>>>>      	u32 reg;
>>>>>>>> +	int res;
>>>>>>>>
>>>>>>>>      	reg = __smu_cmn_poll_stat(smu);
>>>>>>>> -	return __smu_cmn_reg2errno(smu, reg);
>>>>>>>> +	res = __smu_cmn_reg2errno(smu, reg);
>>>>>>>> +
>>>>>>>> +	if (unlikely(amdgpu_smu_debug == 1) && res) {
>>>>>>>> +		mutex_unlock(&smu->message_lock);
>>>>>>>> +		BUG();
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>> +	return res;
>>>>>>>>      }
>>>>>>>>
>>>>>>>>      /**
>>>>>>>> @@ -328,6 +349,7 @@ int smu_cmn_send_smc_msg_with_param(struct
>>>>>>> smu_context *smu,
>>>>>>>>      				    uint32_t param,
>>>>>>>>      				    uint32_t *read_arg)
>>>>>>>>      {
>>>>>>>> +	int retry_count = 0;
>>>>>>>>      	int res, index;
>>>>>>>>      	u32 reg;
>>>>>>>>
>>>>>>>> @@ -349,15 +371,28 @@ int
>>> smu_cmn_send_smc_msg_with_param(struct
>>>>>>> smu_context *smu,
>>>>>>>>      		__smu_cmn_reg_print_error(smu, reg, index, param,
>>> msg);
>>>>>>>>      		goto Out;
>>>>>>>>      	}
>>>>>>>> +retry:
>>>>>>>>      	__smu_cmn_send_msg(smu, (uint16_t) index, param);
>>>>>>>>      	reg = __smu_cmn_poll_stat(smu);
>>>>>>>>      	res = __smu_cmn_reg2errno(smu, reg);
>>>>>>>> -	if (res != 0)
>>>>>>>> +	if (res != 0) {
>>>>>>>>      		__smu_cmn_reg_print_error(smu, reg, index, param,
>>> msg);
>>>>>>>> +		if ((res == -ETIME) && (retry_count++ < 1)) {
>>>>>>>> +			usleep_range(500, 1000);
>>>>>>>> +			dev_err(smu->adev->dev,
>>>>>>>> +				"SMU: resend command: index:%d
>>>>>>> param:0x%08X message:%s",
>>>>>>>> +				index, param,
>> smu_get_message_name(smu,
>>>>>>> msg));
>>>>>>>> +			goto retry;
>>>>>>>> +		}
>>>>>>>> +		goto Out;
>>>>>>>> +	}
>>>>>>>
>>>>>>> Sorry, what I meant is to have an extended wait time in debug mode.
>>>>>>> Something like below, not a 'full retry' as in sending the message again.
>>>>>>>
>>>>>>>
>>>>>>> +#define MAX_DBG_WAIT_CNT 3
>>>>>>> +
>>>>>>>     /**
>>>>>>>      * __smu_cmn_poll_stat -- poll for a status from the SMU
>>>>>>>      * smu: a pointer to SMU context @@ -115,17 +117,24 @@ static
>>>>>>> void smu_cmn_read_arg(struct smu_context *smu,
>>>>>>>     static u32 __smu_cmn_poll_stat(struct smu_context *smu)
>>>>>>>     {
>>>>>>>            struct amdgpu_device *adev = smu->adev;
>>>>>>> -       int timeout = adev->usec_timeout * 20;
>>>>>>> +       int timeout;
>>>>>>>            u32 reg;
>>>>>>> +       int extended_wait = smu_debug_mode ? MAX_DBG_WAIT_CNT : 0;
>>>>>>>
>>>>>>> -       for ( ; timeout > 0; timeout--) {
>>>>>>> -               reg = RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90);
>>>>>>> -               if ((reg & MP1_C2PMSG_90__CONTENT_MASK) != 0)
>>>>>>> -                       break;
>>>>>>> +       do {
>>>>>>> +               timeout = adev->usec_timeout * 20;
>>>>>>> +               for (; timeout > 0; timeout--) {
>>>>>>> +                       reg = RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90);
>>>>>>> +                       if ((reg & MP1_C2PMSG_90__CONTENT_MASK) != 0)
>>>>>>> +                               break;
>>>>>>>
>>>>>>> -               udelay(1);
>>>>>>> -       }
>>>>>>> +                       udelay(1);
>>>>>>> +               }
>>>>>>> +       } while (extended_wait-- >= 0);
>>>>>>>
>>>>>>> +       if (extended_wait != MAX_DBG_WAIT_CNT && reg !=
>>> SMU_RESP_NONE)
>>>>>>> +               dev_err(adev->dev,
>>>>>>> +                       "SMU: Unexpected extended wait for
>>>>>>> + response");
>>>>>>>            return reg;
>>>>>>>     }
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Lijo
>>>>>>>
>>>>>>>>      	if (read_arg)
>>>>>>>>      		smu_cmn_read_arg(smu, read_arg);
>>>>>>>>      Out:
>>>>>>>>      	mutex_unlock(&smu->message_lock);
>>>>>>>> +
>>>>>>>> +	BUG_ON(unlikely(amdgpu_smu_debug == 1) && res);
>>>>>>>> +
>>>>>>>>      	return res;
>>>>>>>>      }
>>>>>>>>
>>>>>>>>


More information about the amd-gfx mailing list