[PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page

Tue Feb 14 06:55:51 UTC 2023

On 2/14/2023 11:42 AM, Zhou1, Tao wrote:
> 
> 
>> -----Original Message-----
>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>> Sent: Tuesday, February 14, 2023 12:55 PM
>> To: Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org; Zhang,
>> Hawking <Hawking.Zhang at amd.com>; Yang, Stanley
>> <Stanley.Yang at amd.com>; Chai, Thomas <YiPeng.Chai at amd.com>; Li, Candice
>> <Candice.Li at amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new
>> bad page
>>
>>
>>
>> On 2/14/2023 7:56 AM, Zhou1, Tao wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>>>> Sent: Monday, February 13, 2023 8:38 PM
>>>> To: Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org;
>>>> Zhang, Hawking <Hawking.Zhang at amd.com>; Yang, Stanley
>>>> <Stanley.Yang at amd.com>; Chai, Thomas <YiPeng.Chai at amd.com>; Li,
>>>> Candice <Candice.Li at amd.com>
>>>> Subject: Re: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if
>>>> no new bad page
>>>>
>>>>
>>>>
>>>> On 2/10/2023 2:15 PM, Tao Zhou wrote:
>>>>> If a UMC bad page is reserved but not freed by an application, the
>>>>> application may trigger uncorrectable error repeatly by accessing the page.
>>>>>
>>>>
>>>> There is amdgpu_ras_check_bad_page which checks if address is already
>>>> part of an existing bad page. Can't that be used?
>>>>
>>>> Thanks,
>>>> Lijo
>>>
>>> [Tao] amdgpu_ras_check_bad_page is already called in
>> amdgpu_ras_add_bad_pages, this patch just makes use of the result of
>> amdgpu_ras_check_bad_page.
>>>
>>
>> In the patch, below two are called after error count is set to 0.
>> 	amdgpu_ras_save_bad_pages
>> 	amdgpu_dpm_send_hbm_bad_pages_num
>>
>> Instead of that, just check if it's an existing badpage which is repeatedly
>> accessed and proceed directly to the next step (reset if
>> specified)
>>
>> 	if (amdgpu_ras_check_bad_page(adev, address))
>> 		set error count to 0;
>> 		goto reset_logic;
>>
>> Thanks,
>> Lijo
> 
> [Tao] 1. amdgpu_ras_check_bad_page checks only one page, but one ue can generate 16 or more pages.

In this case, it makes sense to do as per the patch. Thanks for the 
explanation.

Thanks,
Lijo

> 2. if no new bad page is found, amdgpu_ras_save_bad_pages will do nothing and ras_num_recs won't increase.
> 3. gpu reset logic should follow the old way even ue count is set to 0.
> 
> This patch only set ue count to 0 if there is no new bad page, but other logic has no change.
> 
>>
>>>>
>>>>> Signed-off-by: Tao Zhou <tao.zhou1 at amd.com>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++++++++-
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +++++-
>>>>>     2 files changed, 13 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>>> index e85c4689ce2c..eafe01a24349 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>>>> @@ -2049,7 +2049,7 @@ int amdgpu_ras_add_bad_pages(struct
>>>> amdgpu_device *adev,
>>>>>     {
>>>>>     	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>>>>     	struct ras_err_handler_data *data;
>>>>> -	int ret = 0;
>>>>> +	int ret = 0, old_cnt;
>>>>>     	uint32_t i;
>>>>>
>>>>>     	if (!con || !con->eh_data || !bps || pages <= 0) @@ -2060,6
>>>>> +2060,8 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>>>>>     	if (!data)
>>>>>     		goto out;
>>>>>
>>>>> +	old_cnt = data->count;
>>>>> +
>>>>>     	for (i = 0; i < pages; i++) {
>>>>>     		if (amdgpu_ras_check_bad_page_unlock(con,
>>>>>     			bps[i].retired_page << AMDGPU_GPU_PAGE_SHIFT))
>>>> @@ -2079,6
>>>>> +2081,11 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device
>> *adev,
>>>>>     		data->count++;
>>>>>     		data->space_left--;
>>>>>     	}
>>>>> +
>>>>> +	/* all pages have been reserved before, no new bad page */
>>>>> +	if (old_cnt == data->count)
>>>>> +		ret = -EEXIST;
>>>>> +
>>>>>     out:
>>>>>     	mutex_unlock(&con->recovery_lock);
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
>>>>> index 1c7fcb4f2380..772c431e4065 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
>>>>> @@ -145,8 +145,12 @@ static int
>> amdgpu_umc_do_page_retirement(struct
>>>>> amdgpu_device *adev,
>>>>>
>>>>>     		if ((amdgpu_bad_page_threshold != 0) &&
>>>>>     			err_data->err_addr_cnt) {
>>>>> -			amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
>>>>> +			ret = amdgpu_ras_add_bad_pages(adev, err_data-
>>>>> err_addr,
>>>>>     						err_data->err_addr_cnt);
>>>>> +			/* if no new bad page is found, no need to increase ue
>>>> count */
>>>>> +			if (ret == -EEXIST)
>>>>> +				err_data->ue_count = 0;
>>>>> +
>>>>>     			amdgpu_ras_save_bad_pages(adev);
>>>>>
>>>>>     			amdgpu_dpm_send_hbm_bad_pages_num(adev,
>>>>> con->eeprom_control.ras_num_recs);