[PATCH] drm/amd/pm: Fix a bug communicating with the SMU
Luben Tuikov
luben.tuikov at amd.com
Wed Jul 14 16:57:07 UTC 2021
On 2021-07-14 11:19 a.m., Alex Deucher wrote:
> On Mon, Jul 12, 2021 at 10:56 PM Lazar, Lijo <lijo.lazar at amd.com> wrote:
>>
>>
>> On 7/12/2021 9:00 PM, Luben Tuikov wrote:
>>> This fixes a bug which if we probe a non-existing
>>> I2C device, and the SMU returns 0xFF, from then on
>>> we can never communicate with the SMU, because the
>>> code before this patch reads and interprets 0xFF
>>> as a terminal error, and thus we never write 0
>>> into register 90 to clear the status (and
>>> subsequently send a new command to the SMU.)
>>>
>>> It is not an error that the SMU returns status
>>> 0xFF. This means that the SMU executed the last
>>> command successfully (execution status), but the
>>> command result is an error of some sort (execution
>>> result), depending on what the command was.
>>>
>>> When doing a status check of the SMU, before we
>>> send a new command, the only status which
>>> precludes us from sending a new command is 0--the
>>> SMU hasn't finished executing a previous command,
>>> and 0xFC--the SMU is busy.
>>>
>>> This bug was seen as the following line in the
>>> kernel log,
>>>
>>> amdgpu: Msg issuing pre-check failed(0xff) and SMU may be not in the right state!
>>>
>>> when subsequent SMU commands, not necessarily
>>> related to I2C, were sent to the SMU.
>>>
>>> This patch fixes this bug.
>>>
>>> Cc: Alex Deucher <Alexander.Deucher at amd.com>
>>> Cc: Evan Quan <evan.quan at amd.com>
>>> Fixes: fcb1fe9c9e0031 ("drm/amd/powerplay: pre-check the SMU state before issuing message")
>>> Signed-off-by: Luben Tuikov <luben.tuikov at amd.com>
>>> ---
>>> drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 196 +++++++++++++++++++------
>>> drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h | 3 +-
>>> 2 files changed, 152 insertions(+), 47 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>> index c902fdf322c1be..775eb50a2e49a6 100644
>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>>> @@ -55,7 +55,7 @@
>>>
>>> #undef __SMU_DUMMY_MAP
>>> #define __SMU_DUMMY_MAP(type) #type
>>> -static const char* __smu_message_names[] = {
>>> +static const char * const __smu_message_names[] = {
>>> SMU_MESSAGE_TYPES
>>> };
>>>
>>> @@ -76,46 +76,161 @@ static void smu_cmn_read_arg(struct smu_context *smu,
>>> *arg = RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_82);
>>> }
>>>
>>> -int smu_cmn_wait_for_response(struct smu_context *smu)
>>> +/**
>>> + * __smu_cmn_poll_stat -- poll for a status from the SMU
>>> + * smu: a pointer to SMU context
>>> + *
>>> + * Returns the status of the SMU, which could be,
>>> + * 0, the SMU is busy with your previous command;
>>> + * 1, execution status: success, execution result: success;
>>> + * 0xFF, execution status: success, execution result: failure;
>>> + * 0xFE, unknown command;
>>> + * 0xFD, valid command, but bad (command) prerequisites;
>>> + * 0xFC, the command was rejected as the SMU is busy;
>>> + * 0xFB, "SMC_Result_DebugDataDumpEnd".
>>> + */
>> These are the response codes defined in header (0xFB is somehow missing)
>> // SMU Response Codes:
>> #define PPSMC_Result_OK 0x1
>> #define PPSMC_Result_Failed 0xFF
>> #define PPSMC_Result_UnknownCmd 0xFE
>> #define PPSMC_Result_CmdRejectedPrereq 0xFD
>> #define PPSMC_Result_CmdRejectedBusy 0xFC
>>
>> It's better to use #defines for these, usually we follow a convention
>> like SMU_
> We could do a MAP_RESULT() macro like we do with the messages, etc. to
> make them per asic, but that may be overkill as I think these result
> codes have been the same across asics for a long time.
I think this would be best done in a subsequent/follow-up patch, since
it doesn't affect the behaviour of the system. I feel that such a change, while cosmetic,
would be somewhat involved and deserves its own patch and review, but for now,
we should get the system going.
I also think that ideally we'd want this to arrive from the SMU team perhaps,
so that we're impervious to such changes.
Regards,
Luben
>
> Alex
>
>> Ex:
>> #define SMU_RESP_RESULT_OK 0x1
>>
>>
>>> +static u32 __smu_cmn_poll_stat(struct smu_context *smu)
>>> {
>>> struct amdgpu_device *adev = smu->adev;
>>> - uint32_t cur_value, i, timeout = adev->usec_timeout * 20;
>>> + int timeout = adev->usec_timeout * 20;
>>> + u32 reg;
>>>
>>> - for (i = 0; i < timeout; i++) {
>>> - cur_value = RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90);
>>> - if ((cur_value & MP1_C2PMSG_90__CONTENT_MASK) != 0)
>>> - return cur_value;
>>> + for ( ; timeout > 0; timeout--) {
>>> + reg = RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90);
>>> + if ((reg & MP1_C2PMSG_90__CONTENT_MASK) != 0)
>>> + break;
>>>
>>> udelay(1);
>>> }
>>>
>>> - /* timeout means wrong logic */
>>> - if (i == timeout)
>>> - return -ETIME;
>>> -
>>> - return RREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90);
>>> + return reg;
>>> }
>>>
>>> -int smu_cmn_send_msg_without_waiting(struct smu_context *smu,
>>> - uint16_t msg, uint32_t param)
>>> +static void __smu_cmn_reg_print_error(struct smu_context *smu,
>>> + u32 reg_c2pmsg_90,
>> Instead of using reg/regname in function, it would be better to name it
>> as smu_cmn_resp/smu_resp or similar to make it clear that we are
>> decoding smu response.
>>
>>> + int msg_index,
>>> + u32 param,
>>> + enum smu_message_type msg)
>>> {
>>> struct amdgpu_device *adev = smu->adev;
>>> - int ret;
>>> + const char *message = smu_get_message_name(smu, msg);
>>>
>>> - ret = smu_cmn_wait_for_response(smu);
>>> - if (ret != 0x1) {
>>> - dev_err(adev->dev, "Msg issuing pre-check failed(0x%x) and "
>>> - "SMU may be not in the right state!\n", ret);
>>> - if (ret != -ETIME)
>>> - ret = -EIO;
>>> - return ret;
>>> + switch (reg_c2pmsg_90) {
>>> + case 0:
>>> + dev_err_ratelimited(adev->dev,
>>> + "SMU: I'm not done with your previous command!");
>>> + break;
>>> + case 1:
>>> + /* The SMU executed the command. It completed with a
>>> + * successful result.
>>> + */
>>> + break;
>>> + case 0xFF:
>>> + /* The SMU executed the command. It completed with a
>>> + * unsuccessful result.
>>> + */
>>> + break;
>>> + case 0xFE:
>>> + dev_err_ratelimited(adev->dev,
>>> + "SMU: unknown command: index:%d param:0x%08X message:%s",
>>> + msg_index, param, message);
>>> + break;
>>> + case 0xFD:
>>> + dev_err_ratelimited(adev->dev,
>>> + "SMU: valid command, bad prerequisites: index:%d param:0x%08X message:%s",
>>> + msg_index, param, message);
>>> + break;
>>> + case 0xFC:
>>> + dev_err_ratelimited(adev->dev,
>>> + "SMU: I'm very busy for your command: index:%d param:0x%08X message:%s",
>>> + msg_index, param, message);
>>> + break;
>>> + case 0xFB:
>>> + dev_err_ratelimited(adev->dev,
>>> + "SMU: I'm debugging!");
>>> + break;
>>> + default:
>>> + dev_err_ratelimited(adev->dev,
>>> + "SMU: response:0x%08X for index:%d param:0x%08X message:%s?",
>>> + reg_c2pmsg_90, msg_index, param, message);
>>> + break;
>>> + }
>>> +}
>>> +
>>> +static int __smu_cmn_reg2errno(struct smu_context *smu, u32 reg_c2pmsg_90)
>> Same comment on naming - resp2errno?
>>> +{
>>> + int res;
>>> +
>>> + switch (reg_c2pmsg_90) {
>>> + case 0:
>>> + res = -ETIME;
>>> + break;
>>> + case 1:
>>> + res = 0;
>>> + break;
>>> + case 0xFF:
>>> + res = -EIO;
>>> + break;
>>> + case 0xFE:
>>> + res = -EOPNOTSUPP;
>>> + break;
>>> + case 0xFD:
>>> + res = -EIO;
>>> + break;
>>> + case 0xFC:
>>> + res = -EBUSY;
>>> + break;
>>> + case 0xFB:
>>> + res = -EIO;
>>> + break;
>>> + default:
>>> + res = -EIO;
>>> + break;
>>> }
>>>
>>> + return res;
>>> +}
>>> +
>>> +static void __smu_cmn_send_msg(struct smu_context *smu,
>>> + u16 msg,
>>> + u32 param)
>>> +{
>>> + struct amdgpu_device *adev = smu->adev;
>>> +
>>> WREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_90, 0);
>>> WREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_82, param);
>>> WREG32_SOC15(MP1, 0, mmMP1_SMN_C2PMSG_66, msg);
>>> +}
>>>
>>> - return 0;
>>> +int smu_cmn_send_msg_without_waiting(struct smu_context *smu,
>>> + uint16_t msg_index,
>>> + uint32_t param)
>>> +{
>>> + u32 reg;
>>> + int res;
>>> +
>>> + if (smu->adev->in_pci_err_recovery)
>>> + return 0;
>>> +
>>> + mutex_lock(&smu->message_lock);
>>> + reg = __smu_cmn_poll_stat(smu);
>>> + if (reg == 0 || reg == 0xFC) {
>> The problem with 0xFC check is it could be the response of a previous
>> message. It could mean that FW was busy when the prev message was sent,
>> not now.
>>
>> There is a default case (value not in any of the predefined error
>> codes), that should be considered here also. That happens sometimes and
>> usually that means FW is in undefined state.
>>
>>
>>> + res = __smu_cmn_reg2errno(smu, reg);
>>> + goto Out;
>> Label naming style, lower case?.
>>
>>> + }
>>> + __smu_cmn_send_msg(smu, msg_index, param);
>>> + res = 0;
>>> +Out:
>>> + mutex_unlock(&smu->message_lock);
>>> + return res;
>>> +}
>>> +
>>> +int smu_cmn_wait_for_response(struct smu_context *smu)
>>> +{
>>> + u32 reg;
>>> +
>>> + reg = __smu_cmn_poll_stat(smu);
>>> + return __smu_cmn_reg2errno(smu, reg);
>>> }
>>>
>>> int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
>>> @@ -123,8 +238,8 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
>>> uint32_t param,
>>> uint32_t *read_arg)
>>> {
>>> - struct amdgpu_device *adev = smu->adev;
>>> - int ret = 0, index = 0;
>>> + int res, index;
>>> + u32 reg;
>>>
>>> if (smu->adev->in_pci_err_recovery)
>>> return 0;
>>> @@ -136,31 +251,20 @@ int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
>>> return index == -EACCES ? 0 : index;
>>>
>>> mutex_lock(&smu->message_lock);
>>> - ret = smu_cmn_send_msg_without_waiting(smu, (uint16_t)index, param);
>>> - if (ret)
>>> - goto out;
>>> -
>>> - ret = smu_cmn_wait_for_response(smu);
>>> - if (ret != 0x1) {
>>> - if (ret == -ETIME) {
>>> - dev_err(adev->dev, "message: %15s (%d) \tparam: 0x%08x is timeout (no response)\n",
>>> - smu_get_message_name(smu, msg), index, param);
>>> - } else {
>>> - dev_err(adev->dev, "failed send message: %15s (%d) \tparam: 0x%08x response %#x\n",
>>> - smu_get_message_name(smu, msg), index, param,
>>> - ret);
>>> - ret = -EIO;
>>> - }
>>> - goto out;
>>> + reg = __smu_cmn_poll_stat(smu);
>>> + if (reg == 0 || reg == 0xFC) {
>> Same comments as for without_waiting case.
>>
>>> + res = __smu_cmn_reg2errno(smu, reg);
>>> + __smu_cmn_reg_print_error(smu, reg, index, param, msg);
>> This precheck fail print is missing in without_waiting message.
>>
>>> + goto Out; > }
>>> -
>>> + __smu_cmn_send_msg(smu, (uint16_t) index, param);
>>> + reg = __smu_cmn_poll_stat(smu);
>>> + res = __smu_cmn_reg2errno(smu, reg);
>> Using smu_cmn_wait_for_response instead of these two makes the intent
>> clearer - that we are waiting for the response.
>>
>> We need a print here as well if the message has failed.
>>
>> Thanks,
>> Lijo
>>
>>> if (read_arg)
>>> smu_cmn_read_arg(smu, read_arg);
>>> -
>>> - ret = 0; /* 0 as driver return value */
>>> -out:
>>> +Out:
>>> mutex_unlock(&smu->message_lock);
>>> - return ret;
>>> + return res;
>>> }
>>>
>>> int smu_cmn_send_smc_msg(struct smu_context *smu,
>>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h
>>> index 9add5f16ff562a..16993daa2ae042 100644
>>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h
>>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.h
>>> @@ -27,7 +27,8 @@
>>>
>>> #if defined(SWSMU_CODE_LAYER_L2) || defined(SWSMU_CODE_LAYER_L3) || defined(SWSMU_CODE_LAYER_L4)
>>> int smu_cmn_send_msg_without_waiting(struct smu_context *smu,
>>> - uint16_t msg, uint32_t param);
>>> + uint16_t msg_index,
>>> + uint32_t param);
>>> int smu_cmn_send_smc_msg_with_param(struct smu_context *smu,
>>> enum smu_message_type msg,
>>> uint32_t param,
>>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7Cluben.tuikov%40amd.com%7C63c12227eaee4417d06c08d946dad4be%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637618727955855924%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Au4k6zam4oVtRNl2JA%2BNEQTGjiOLxZULDKQnYDQg9ho%3D&reserved=0
More information about the amd-gfx
mailing list