[PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case

Thu Oct 21 13:57:32 UTC 2021

[AMD Official Use Only]

> -----Original Message-----
> From: Tuikov, Luben <Luben.Tuikov at amd.com>
> Sent: Wednesday, October 20, 2021 6:01 PM
> To: Kuehling, Felix <Felix.Kuehling at amd.com>; Russell, Kent <Kent.Russell at amd.com>;
> amd-gfx at lists.freedesktop.org
> Cc: Joshi, Mukul <Mukul.Joshi at amd.com>
> Subject: Re: [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case
> 
> On 2021-10-20 17:54, Felix Kuehling wrote:
> > On 2021-10-20 12:35 p.m., Kent Russell wrote:
> >> If the bad_page_threshold kernel parameter is set to -2,
> >> continue to post the GPU. Print a warning to dmesg that this action has
> >> been done, and that page retirement will obviously not work for said GPU
> > I'd squash patch 2 and 3. The squashed patch is
> >
> > Acked-by: Felix Kuehling <Felix.Kuehling at amd.com>
> 
> I was just thinking the same thing. Keep the title and text of patch 2 and add the description
> of 3 to 2. With that done:
> 
> Reviewed-by: Luben Tuikov <luben.tuikov at amd.com>

Sounds good, thanks. I was on the fence about combining them from when I had the separate kernel param, and it was easier to squash it at review time than to separate it. I'll still need to work on patch #1 but thanks for the reviews here!

 Kent

> 
> Regards,
> Luben
> 
> >
> >
> >> Cc: Luben Tuikov <luben.tuikov at amd.com>
> >> Cc: Mukul Joshi <Mukul.Joshi at amd.com>
> >> Signed-off-by: Kent Russell <kent.russell at amd.com>
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 13 +++++++++----
> >>   1 file changed, 9 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >> index 1ede0f0d6f55..31852330c1db 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >> @@ -1115,11 +1115,16 @@ int amdgpu_ras_eeprom_init(struct
> amdgpu_ras_eeprom_control *control,
> >>   			res = amdgpu_ras_eeprom_correct_header_tag(control,
> >>   								   RAS_TABLE_HDR_VAL);
> >>   		} else {
> >> -			*exceed_err_limit = true;
> >> -			dev_err(adev->dev,
> >> -				"RAS records:%d exceed threshold:%d, "
> >> -				"GPU will not be initialized. Replace this GPU or increase the
> threshold",
> >> +			dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
> >>   				control->ras_num_recs, ras->bad_page_cnt_threshold);
> >> +			if (amdgpu_bad_page_threshold == -2) {
> >> +				dev_warn(adev->dev, "GPU will be initialized due to
> bad_page_threshold = -2.");
> >> +				dev_warn(adev->dev, "Page retirement will not work for
> this GPU in this state.");
> >> +				res = 0;
> >> +			} else {
> >> +				*exceed_err_limit = true;
> >> +				dev_err(adev->dev, "GPU will not be initialized. Replace this
> GPU or increase the threshold.");
> >> +			}
> >>   		}
> >>   	} else {
> >>   		DRM_INFO("Creating a new EEPROM table");