[PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case

Lazar, Lijo lijo.lazar at amd.com
Fri Oct 22 11:26:33 UTC 2021



On 10/21/2021 7:26 PM, Russell, Kent wrote:
> [AMD Official Use Only]
> 
> 
> 
>> -----Original Message-----
>> From: Lazar, Lijo <Lijo.Lazar at amd.com>
>> Sent: Thursday, October 21, 2021 1:25 AM
>> To: Russell, Kent <Kent.Russell at amd.com>; amd-gfx at lists.freedesktop.org
>> Cc: Tuikov, Luben <Luben.Tuikov at amd.com>; Joshi, Mukul <Mukul.Joshi at amd.com>
>> Subject: Re: [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case
>>
>>
>>
>> On 10/20/2021 10:05 PM, Kent Russell wrote:
>>> If the bad_page_threshold kernel parameter is set to -2,
>>> continue to post the GPU. Print a warning to dmesg that this action has
>>> been done, and that page retirement will obviously not work for said GPU
>>>
>>> Cc: Luben Tuikov <luben.tuikov at amd.com>
>>> Cc: Mukul Joshi <Mukul.Joshi at amd.com>
>>> Signed-off-by: Kent Russell <kent.russell at amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 13 +++++++++----
>>>    1 file changed, 9 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>>> index 1ede0f0d6f55..31852330c1db 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>>> @@ -1115,11 +1115,16 @@ int amdgpu_ras_eeprom_init(struct
>> amdgpu_ras_eeprom_control *control,
>>>    			res = amdgpu_ras_eeprom_correct_header_tag(control,
>>>    								   RAS_TABLE_HDR_VAL);
>>>    		} else {
>>> -			*exceed_err_limit = true;
>>> -			dev_err(adev->dev,
>>> -				"RAS records:%d exceed threshold:%d, "
>>> -				"GPU will not be initialized. Replace this GPU or increase the
>> threshold",
>>> +			dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
>>>    				control->ras_num_recs, ras->bad_page_cnt_threshold);
>>> +			if (amdgpu_bad_page_threshold == -2) {
>>> +				dev_warn(adev->dev, "GPU will be initialized due to
>> bad_page_threshold = -2.");
>>> +				dev_warn(adev->dev, "Page retirement will not work for
>> this GPU in this state.");
>>
>> Now, this looks as good as booting with 0 = disable bad page retirement.
>> I thought page retirement will work as long as EEPROM has space, but it
>> won't bother about the threshold. If the intent is to ignore bad page
>> retirement, then 0 is good enough and -2 is not required.
>>
>> Also, when user passes threshold=-2, what is the threshold being
>> compared against to say *exceed_err_limit = true?
> 
> My thought on having the -2 option is that we'll still enable page retirement, we just won't shut the GPU down when it hits the threshold. The bad pages will still be retired as we hit them, it will just never disable the GPU. The comment about retirement not working is definitely incorrect now (leftover from previous local patches), so I'll remove that. In this case, I don't think we'd ever exceed the error limit. exceed_err_limit is only really used when we are disabling the GPU, so we wouldn't want to set that to true. Otherwise we wouldn't be loading the bad pages in gpu_recovery_init, and we'll still return 0 from gpu_recovery_init.
> 

Probably, this is too late as this patch is superseded. Replying to this 
to keep the context.

I realized that the question went in the wrong direction because of 
*exceed_err_limit = true. Saw it in the earlier line but didn't realize 
that it was removed.

Anyway, main question I wanted to ask was -
	
	"When user passes threshold=-2, what is the threshold being compared 
against?"

Looked it up today and saw this piece of code.
	        if (amdgpu_bad_page_threshold < 0) {
                 u64 val = adev->gmc.mc_vram_size;

                 do_div(val, RAS_BAD_PAGE_COVER);
                 con->bad_page_cnt_threshold = min(lower_32_bits(val),
                                                   max_count);


For a user who has been specifying say bad_page_threshold=100 in the 
command line and unable to boot, now this is doing something else.

The comparison in next boot is done against a different threshold. In 
fact, it won't even come to this comparison logic in the next boot as 
the threshold is raised by default when this parameter is specified.

The only case this parameter takes effect is when a user is unable to 
boot because bad page retirements have exceeded the "default threshold". 
In other cases it just resets the comparison limit. Hope that is the 
intention.

Thanks,
Lijo


>   Kent
>>
>> Thanks,
>> Lijo
>>
>>> +				res = 0;
>>> +			} else {
>>> +				*exceed_err_limit = true;
>>> +				dev_err(adev->dev, "GPU will not be initialized. Replace this
>> GPU or increase the threshold.");
>>> +			}
>>>    		}
>>>    	} else {
>>>    		DRM_INFO("Creating a new EEPROM table");
>>>


More information about the amd-gfx mailing list