[PATCH v4 5/6] drm/amdgpu/vcn: VCN ras error query support

Paul Menzel pmenzel at molgen.mpg.de
Mon Mar 28 08:08:45 UTC 2022


Dear Mohammad,


Am 28.03.22 um 10:00 schrieb Ziya, Mohammad zafar:

[…]

>> From: Paul Menzel <pmenzel at molgen.mpg.de>
>> Sent: Monday, March 28, 2022 1:22 PM

[…]

>> Tao, it’s hard to find your reply in the quote, as the message is not quoted
>> correctly (> prepended). Is it possible to use a different email client?
>>
>>
>> Am 28.03.22 um 09:43 schrieb Zhou1, Tao:
>>> -----Original Message-----
>>> From: Ziya, Mohammad zafar <Mohammadzafar.Ziya at amd.com>
>>> Sent: Monday, March 28, 2022 2:25 PM

[…]

>>> +static uint32_t vcn_v2_6_query_poison_by_instance(struct amdgpu_device *adev,
>>> +			uint32_t instance, uint32_t sub_block) {
>>> +	uint32_t poison_stat = 0, reg_value = 0;
>>> +
>>> +	switch (sub_block) {
>>> +	case AMDGPU_VCN_V2_6_VCPU_VCODEC:
>>> +		reg_value = RREG32_SOC15(VCN, instance, mmUVD_RAS_VCPU_VCODEC_STATUS);
>>> +		poison_stat = REG_GET_FIELD(reg_value, UVD_RAS_VCPU_VCODEC_STATUS, POISONED_PF);
>>> +		break;
>>> +	default:
>>> +		break;
>>> +	};
>>> +
>>> +	if (poison_stat)
>>> +		dev_info(adev->dev, "Poison detected in VCN%d, sub_block%d\n",
>>> +			instance, sub_block);
>>
>> What should a user do with that information? Faulty hardware, …?
> 
> [Mohammad]: This message will help to identify the faulty hardware,
> the hardware ID will also log along with poison, help to identify
> among multiple hardware installed on the system.

Thank you for clarifying. If it’s indeed faulty hardware, should the log 
level be increased to be an error? Keep in mind, that normal ignorant 
users (like me) are reading the message, and it’d be great to guide them 
a little. They do not know what “Poison“ means I guess. Maybe:

A hardware corruption was found indicating the device might be faulty. 
(Poison detected in VCN%d, sub_block%d)\n

(Keep in mind, I do not know anything about RAS.)

>>> +
>>> +	return poison_stat;
>>> +}
>>> +
>>> +static bool vcn_v2_6_query_poison_status(struct amdgpu_device *adev) {
>>> +	uint32_t inst, sub;
>>> +	uint32_t poison_stat = 0;
>>> +
>>> +	for (inst = 0; inst < adev->vcn.num_vcn_inst; inst++)
>>> +		for (sub = 0; sub < AMDGPU_VCN_V2_6_MAX_SUB_BLOCK; sub++)
>>> +			poison_stat +=
>>> +			vcn_v2_6_query_poison_by_instance(adev, inst, sub);
>>> +
>>> +	return poison_stat ? true : false;
>>>
>>> [Tao] just want to confirm the logic, if the POISONED_PF of one instance is 1
>> and another is 0, the poison_stat is true, correct?
>>> Can we add a print message for this case? Same question to JPEG.
> 
> [Mohammad]: Message will be printed on function block ahead of the function.
> 
>> Is the `dev_info` message in `vcn_v2_6_query_poison_by_instance()` doing
>> what you want?
>>
>> Also, instead of `poison_stat ? true : false;` a common pattern is
>> `!!poison_stat` I believe.

[…]

> [Mohammad]: Noted the change. Will make to return !!poison_stat ? true : false;

No, just

     return !!poison_stat;

[…]


Kind regards,

Paul


More information about the amd-gfx mailing list