[PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page

Tue Feb 14 04:55:05 UTC 2023


On 2/14/2023 7:56 AM, Zhou1, Tao wrote:
> 
> 
>> -----Original Message-----
>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>> Sent: Monday, February 13, 2023 8:38 PM
>> To: Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org; Zhang,
>> Hawking <Hawking.Zhang at amd.com>; Yang, Stanley
>> <Stanley.Yang at amd.com>; Chai, Thomas <YiPeng.Chai at amd.com>; Li, Candice
>> <Candice.Li at amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new
>> bad page
>>
>>
>>
>> On 2/10/2023 2:15 PM, Tao Zhou wrote:
>>> If a UMC bad page is reserved but not freed by an application, the
>>> application may trigger uncorrectable error repeatly by accessing the page.
>>>
>>
>> There is amdgpu_ras_check_bad_page which checks if address is already part of
>> an existing bad page. Can't that be used?
>>
>> Thanks,
>> Lijo
> 
> [Tao] amdgpu_ras_check_bad_page is already called in amdgpu_ras_add_bad_pages, this patch just makes use of the result of amdgpu_ras_check_bad_page.
> 

In the patch, below two are called after error count is set to 0.
	amdgpu_ras_save_bad_pages
	amdgpu_dpm_send_hbm_bad_pages_num

Instead of that, just check if it's an existing badpage which is 
repeatedly accessed and proceed directly to the next step (reset if 
specified)

	if (amdgpu_ras_check_bad_page(adev, address))
		set error count to 0;
		goto reset_logic;

Thanks,
Lijo

>>
>>> Signed-off-by: Tao Zhou <tao.zhou1 at amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++++++++-
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +++++-
>>>    2 files changed, 13 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> index e85c4689ce2c..eafe01a24349 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>>> @@ -2049,7 +2049,7 @@ int amdgpu_ras_add_bad_pages(struct
>> amdgpu_device *adev,
>>>    {
>>>    	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>>>    	struct ras_err_handler_data *data;
>>> -	int ret = 0;
>>> +	int ret = 0, old_cnt;
>>>    	uint32_t i;
>>>
>>>    	if (!con || !con->eh_data || !bps || pages <= 0) @@ -2060,6 +2060,8
>>> @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>>>    	if (!data)
>>>    		goto out;
>>>
>>> +	old_cnt = data->count;
>>> +
>>>    	for (i = 0; i < pages; i++) {
>>>    		if (amdgpu_ras_check_bad_page_unlock(con,
>>>    			bps[i].retired_page << AMDGPU_GPU_PAGE_SHIFT))
>> @@ -2079,6
>>> +2081,11 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>>>    		data->count++;
>>>    		data->space_left--;
>>>    	}
>>> +
>>> +	/* all pages have been reserved before, no new bad page */
>>> +	if (old_cnt == data->count)
>>> +		ret = -EEXIST;
>>> +
>>>    out:
>>>    	mutex_unlock(&con->recovery_lock);
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
>>> index 1c7fcb4f2380..772c431e4065 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
>>> @@ -145,8 +145,12 @@ static int amdgpu_umc_do_page_retirement(struct
>>> amdgpu_device *adev,
>>>
>>>    		if ((amdgpu_bad_page_threshold != 0) &&
>>>    			err_data->err_addr_cnt) {
>>> -			amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
>>> +			ret = amdgpu_ras_add_bad_pages(adev, err_data-
>>> err_addr,
>>>    						err_data->err_addr_cnt);
>>> +			/* if no new bad page is found, no need to increase ue
>> count */
>>> +			if (ret == -EEXIST)
>>> +				err_data->ue_count = 0;
>>> +
>>>    			amdgpu_ras_save_bad_pages(adev);
>>>
>>>    			amdgpu_dpm_send_hbm_bad_pages_num(adev,
>>> con->eeprom_control.ras_num_recs);