[PATCH 2/2] drm/amdgpu: add bad_page_threshold check in ras_eeprom_check_err
Yang, Stanley
Stanley.Yang at amd.com
Wed Feb 22 03:16:13 UTC 2023
[AMD Official Use Only - General]
The series is Reviewed-by: Stanley.Yang <Stanley.Yang at amd.com>
Regards,
Stanley
> -----Original Message-----
> From: Zhou1, Tao <Tao.Zhou1 at amd.com>
> Sent: Wednesday, February 22, 2023 10:52 AM
> To: amd-gfx at lists.freedesktop.org; Zhang, Hawking
> <Hawking.Zhang at amd.com>; Yang, Stanley <Stanley.Yang at amd.com>; Chai,
> Thomas <YiPeng.Chai at amd.com>; Li, Candice <Candice.Li at amd.com>; Lazar,
> Lijo <Lijo.Lazar at amd.com>
> Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>
> Subject: [PATCH 2/2] drm/amdgpu: add bad_page_threshold check in
> ras_eeprom_check_err
>
> bad_page_threshold controls page retirement behavior and it should be also
> checked.
>
> v2: simplify the condition of bad page handling path.
>
> Signed-off-by: Tao Zhou <tao.zhou1 at amd.com>
> ---
> .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 19 ++++++++++++++-
> ----
> 1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index 9d370465b08d..2e08fce87521 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -417,7 +417,8 @@ bool
> amdgpu_ras_eeprom_check_err_threshold(struct amdgpu_device *adev) {
> struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>
> - if (!__is_ras_eeprom_supported(adev))
> + if (!__is_ras_eeprom_supported(adev) ||
> + !amdgpu_bad_page_threshold)
> return false;
>
> /* skip check eeprom table for VEGA20 Gaming */ @@ -428,10
> +429,18 @@ bool amdgpu_ras_eeprom_check_err_threshold(struct
> amdgpu_device *adev)
> return false;
>
> if (con->eeprom_control.tbl_hdr.header == RAS_TABLE_HDR_BAD) {
> - dev_warn(adev->dev, "This GPU is in BAD status.");
> - dev_warn(adev->dev, "Please retire it or set a larger "
> - "threshold value when reloading driver.\n");
> - return true;
> + if (amdgpu_bad_page_threshold == -1) {
> + dev_warn(adev->dev, "RAS records:%d exceed
> threshold:%d",
> + con->eeprom_control.ras_num_recs, con-
> >bad_page_cnt_threshold);
> + dev_warn(adev->dev,
> + "But GPU can be operated due to
> bad_page_threshold = -1.\n");
> + return false;
> + } else {
> + dev_warn(adev->dev, "This GPU is in BAD status.");
> + dev_warn(adev->dev, "Please retire it or set a larger
> "
> + "threshold value when reloading driver.\n");
> + return true;
> + }
> }
>
> return false;
> --
> 2.35.1
More information about the amd-gfx
mailing list