[PATCH v4 5/6] drm/amdgpu/vcn: VCN ras error query support

Mon Mar 28 10:11:20 UTC 2022

On 3/28/2022 3:25 PM, Paul Menzel wrote:
> Dear Mohammad,
> 
> 
> Am 28.03.22 um 11:49 schrieb Ziya, Mohammad zafar:
> 
>>> -----Original Message-----
>>> From: Paul Menzel <pmenzel at molgen.mpg.de>
>>> Sent: Monday, March 28, 2022 3:08 PM
> 
>>> Am 28.03.22 um 10:47 schrieb Ziya, Mohammad zafar:
>>>
>>> […]
>>>
>>>>> -----Original Message-----
>>>>> From: Paul Menzel <pmenzel at molgen.mpg.de>
>>>>> Sent: Monday, March 28, 2022 1:39 PM
>>>
>>>>> Am 28.03.22 um 10:00 schrieb Ziya, Mohammad zafar:
>>>>>
>>>>> […]
>>>>>
>>>>>>> From: Paul Menzel <pmenzel at molgen.mpg.de>
>>>>>>> Sent: Monday, March 28, 2022 1:22 PM
>>>
>>>>>>> Am 28.03.22 um 09:43 schrieb Zhou1, Tao:
>>>>>>>> -----Original Message-----
>>>>>>>> From: Ziya, Mohammad zafar <Mohammadzafar.Ziya at amd.com>
>>>>>>>> Sent: Monday, March 28, 2022 2:25 PM
>>>>>
>>>>> […]
>>>>>
>>>>>>>> +static uint32_t vcn_v2_6_query_poison_by_instance(struct 
>>>>>>>> amdgpu_device *adev,
>>>>>>>> +            uint32_t instance, uint32_t sub_block) {
>>>>>>>> +    uint32_t poison_stat = 0, reg_value = 0;
>>>>>>>> +
>>>>>>>> +    switch (sub_block) {
>>>>>>>> +    case AMDGPU_VCN_V2_6_VCPU_VCODEC:
>>>>>>>> +        reg_value = RREG32_SOC15(VCN, instance, 
>>>>>>>> mmUVD_RAS_VCPU_VCODEC_STATUS);
>>>>>>>> +        poison_stat = REG_GET_FIELD(reg_value, 
>>>>>>>> UVD_RAS_VCPU_VCODEC_STATUS, POISONED_PF);
>>>>>>>> +        break;
>>>>>>>> +    default:
>>>>>>>> +        break;
>>>>>>>> +    };
>>>>>>>> +
>>>>>>>> +    if (poison_stat)
>>>>>>>> +        dev_info(adev->dev, "Poison detected in VCN%d, 
>>>>>>>> sub_block%d\n",
>>>>>>>> +            instance, sub_block);
>>>>>>>
>>>>>>> What should a user do with that information? Faulty hardware, …?
>>>>>>
>>>>>> [Mohammad]: This message will help to identify the faulty hardware,
>>>>>> the hardware ID will also log along with poison, help to identify
>>>>>> among multiple hardware installed on the system.
>>>>>
>>>>> Thank you for clarifying. If it’s indeed faulty hardware, should the
>>>>> log level be increased to be an error? Keep in mind, that normal
>>>>> ignorant users (like me) are reading the message, and it’d be great
>>>>> to guide them a little. They do not know what “Poison“ means I 
>>>>> guess. Maybe:
>>>>>
>>>>> A hardware corruption was found indicating the device might be faulty.
>>>>> (Poison detected in VCN%d, sub_block%d)\n
>>>>>
>>>>> (Keep in mind, I do not know anything about RAS.)
>>>>
>>>> [Mohammad]: It is an error condition, but this is just an information
>>>> message which could have been ignored as well because VCN just
>>>> consumed the poison, not created.
>>>
>>> Sorry, I have never seen these message in `dmesg`, so could you give an
>>> example log please, what the user would see?
>>>
>>
>> [Mohammad]: [  231.181316] amdgpu 0000:8a:00.0: amdgpu: Poison 
>> detected in VCN0, sub_block0
>>
>> Sample message from amdgpu " [  237.013029] amdgpu 0000:8a:00.0: 
>> amdgpu: HDCP: optional hdcp ta ucode is not available "
> 
> Hmm, that is six seconds later, so, if Linux logs other stuff in 
> between, no idea if the connection will be made.
> 
> Both messages read like debug message, with normal users not having a 
> clue what to do. Can that be improved by rewording them?
> 

Hi Paul,

In general, when this is detected there will be a subsequent recovery 
action done by amdgpu. The above message is mainly to identify why the 
recovery action happened. The steps to be done for recovery is a 
"work-in-progress".

The occurrence of such a thing is expected to be rare and a general user 
doesn't need to do anything on seeing this. If at all something really 
unexpected happens during such rare cases, this message in dmesg helps 
to identify what happened and whether proper action is taken by the driver.

Thanks,
Lijo

> 
> Kind regards,
> 
> Paul