[PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case

Russell, Kent Kent.Russell at amd.com
Thu Oct 21 13:56:34 UTC 2021


[AMD Official Use Only]



> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar at amd.com>
> Sent: Thursday, October 21, 2021 1:25 AM
> To: Russell, Kent <Kent.Russell at amd.com>; amd-gfx at lists.freedesktop.org
> Cc: Tuikov, Luben <Luben.Tuikov at amd.com>; Joshi, Mukul <Mukul.Joshi at amd.com>
> Subject: Re: [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case
> 
> 
> 
> On 10/20/2021 10:05 PM, Kent Russell wrote:
> > If the bad_page_threshold kernel parameter is set to -2,
> > continue to post the GPU. Print a warning to dmesg that this action has
> > been done, and that page retirement will obviously not work for said GPU
> >
> > Cc: Luben Tuikov <luben.tuikov at amd.com>
> > Cc: Mukul Joshi <Mukul.Joshi at amd.com>
> > Signed-off-by: Kent Russell <kent.russell at amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 13 +++++++++----
> >   1 file changed, 9 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > index 1ede0f0d6f55..31852330c1db 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > @@ -1115,11 +1115,16 @@ int amdgpu_ras_eeprom_init(struct
> amdgpu_ras_eeprom_control *control,
> >   			res = amdgpu_ras_eeprom_correct_header_tag(control,
> >   								   RAS_TABLE_HDR_VAL);
> >   		} else {
> > -			*exceed_err_limit = true;
> > -			dev_err(adev->dev,
> > -				"RAS records:%d exceed threshold:%d, "
> > -				"GPU will not be initialized. Replace this GPU or increase the
> threshold",
> > +			dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
> >   				control->ras_num_recs, ras->bad_page_cnt_threshold);
> > +			if (amdgpu_bad_page_threshold == -2) {
> > +				dev_warn(adev->dev, "GPU will be initialized due to
> bad_page_threshold = -2.");
> > +				dev_warn(adev->dev, "Page retirement will not work for
> this GPU in this state.");
> 
> Now, this looks as good as booting with 0 = disable bad page retirement.
> I thought page retirement will work as long as EEPROM has space, but it
> won't bother about the threshold. If the intent is to ignore bad page
> retirement, then 0 is good enough and -2 is not required.
> 
> Also, when user passes threshold=-2, what is the threshold being
> compared against to say *exceed_err_limit = true?

My thought on having the -2 option is that we'll still enable page retirement, we just won't shut the GPU down when it hits the threshold. The bad pages will still be retired as we hit them, it will just never disable the GPU. The comment about retirement not working is definitely incorrect now (leftover from previous local patches), so I'll remove that. In this case, I don't think we'd ever exceed the error limit. exceed_err_limit is only really used when we are disabling the GPU, so we wouldn't want to set that to true. Otherwise we wouldn't be loading the bad pages in gpu_recovery_init, and we'll still return 0 from gpu_recovery_init.

 Kent
> 
> Thanks,
> Lijo
> 
> > +				res = 0;
> > +			} else {
> > +				*exceed_err_limit = true;
> > +				dev_err(adev->dev, "GPU will not be initialized. Replace this
> GPU or increase the threshold.");
> > +			}
> >   		}
> >   	} else {
> >   		DRM_INFO("Creating a new EEPROM table");
> >


More information about the amd-gfx mailing list