[PATCH 2/2] drm/amdgpu: add bad_page_threshold check in ras_eeprom_check_err
Zhou1, Tao
Tao.Zhou1 at amd.com
Tue Feb 21 10:06:56 UTC 2023
[AMD Official Use Only - General]
> -----Original Message-----
> From: Yang, Stanley <Stanley.Yang at amd.com>
> Sent: Tuesday, February 21, 2023 5:34 PM
> To: Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org; Zhang,
> Hawking <Hawking.Zhang at amd.com>; Chai, Thomas <YiPeng.Chai at amd.com>;
> Li, Candice <Candice.Li at amd.com>
> Subject: RE: [PATCH 2/2] drm/amdgpu: add bad_page_threshold check in
> ras_eeprom_check_err
>
> [AMD Official Use Only - General]
>
>
>
> > -----Original Message-----
> > From: Zhou1, Tao <Tao.Zhou1 at amd.com>
> > Sent: Tuesday, February 21, 2023 4:29 PM
> > To: amd-gfx at lists.freedesktop.org; Zhang, Hawking
> > <Hawking.Zhang at amd.com>; Yang, Stanley <Stanley.Yang at amd.com>; Chai,
> > Thomas <YiPeng.Chai at amd.com>; Li, Candice <Candice.Li at amd.com>
> > Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>
> > Subject: [PATCH 2/2] drm/amdgpu: add bad_page_threshold check in
> > ras_eeprom_check_err
> >
> > bad_page_threshold controls page retirement behavior and it should be
> > also checked.
> >
> > Signed-off-by: Tao Zhou <tao.zhou1 at amd.com>
> > ---
> > .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 20 ++++++++++++++-
> > ----
> > 1 file changed, 15 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > index 9d370465b08d..c88123896fe8 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > @@ -417,7 +417,8 @@ bool
> > amdgpu_ras_eeprom_check_err_threshold(struct amdgpu_device *adev) {
> > struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> >
> > - if (!__is_ras_eeprom_supported(adev))
> > + if (!__is_ras_eeprom_supported(adev) ||
> > + !amdgpu_bad_page_threshold)
> > return false;
> >
> > /* skip check eeprom table for VEGA20 Gaming */ @@ -428,10
> > +429,19 @@ bool amdgpu_ras_eeprom_check_err_threshold(struct
> > amdgpu_device *adev)
> > return false;
> >
> > if (con->eeprom_control.tbl_hdr.header == RAS_TABLE_HDR_BAD) {
> > - dev_warn(adev->dev, "This GPU is in BAD status.");
> > - dev_warn(adev->dev, "Please retire it or set a larger "
> > - "threshold value when reloading driver.\n");
> > - return true;
> > + if (amdgpu_bad_page_threshold == -1) {
> > + dev_warn(adev->dev, "RAS records:%d exceed
> > threshold:%d",
> > + con->eeprom_control.ras_num_recs, con-
> > >bad_page_cnt_threshold);
> > + dev_warn(adev->dev,
> > + "But GPU can be operated due to
> > bad_page_threshold = -1.\n");
> > + return false;
> > + } else if (amdgpu_bad_page_threshold > 0 ||
> > + amdgpu_bad_page_threshold == -2) {
>
> Stanley: it can't guarantee use to set amdgpu_bad_page_threshold value as
> expected for example -3, how about set this if condition as below
[Tao] Since "<= -2" and "> 0" can be treated as same thing here, will update the condition to "else".
The "-2" isn't retired, it indicates threshold number is calculated by driver.
> else if (amdgpu_bad_page_threshold) {
> ...
> }
> And in patch#1 the value -2 isn't need anymore.
>
> Regards,
> Stanley
> > + dev_warn(adev->dev, "This GPU is in BAD status.");
> > + dev_warn(adev->dev, "Please retire it or set a larger
> > "
> > + "threshold value when reloading driver.\n");
> > + return true;
> > + }
> > }
> >
> > return false;
> > --
> > 2.35.1
More information about the amd-gfx
mailing list