[PATCH 2/2] drm/amdgpu: add bad_page_threshold check in ras_eeprom_check_err

Yang, Stanley Stanley.Yang at amd.com
Wed Feb 22 03:16:13 UTC 2023


[AMD Official Use Only - General]

The series is Reviewed-by: Stanley.Yang <Stanley.Yang at amd.com>

Regards,
Stanley
> -----Original Message-----
> From: Zhou1, Tao <Tao.Zhou1 at amd.com>
> Sent: Wednesday, February 22, 2023 10:52 AM
> To: amd-gfx at lists.freedesktop.org; Zhang, Hawking
> <Hawking.Zhang at amd.com>; Yang, Stanley <Stanley.Yang at amd.com>; Chai,
> Thomas <YiPeng.Chai at amd.com>; Li, Candice <Candice.Li at amd.com>; Lazar,
> Lijo <Lijo.Lazar at amd.com>
> Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>
> Subject: [PATCH 2/2] drm/amdgpu: add bad_page_threshold check in
> ras_eeprom_check_err
> 
> bad_page_threshold controls page retirement behavior and it should be also
> checked.
> 
> v2: simplify the condition of bad page handling path.
> 
> Signed-off-by: Tao Zhou <tao.zhou1 at amd.com>
> ---
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    | 19 ++++++++++++++-
> ----
>  1 file changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index 9d370465b08d..2e08fce87521 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -417,7 +417,8 @@ bool
> amdgpu_ras_eeprom_check_err_threshold(struct amdgpu_device *adev)  {
>  	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> 
> -	if (!__is_ras_eeprom_supported(adev))
> +	if (!__is_ras_eeprom_supported(adev) ||
> +	    !amdgpu_bad_page_threshold)
>  		return false;
> 
>  	/* skip check eeprom table for VEGA20 Gaming */ @@ -428,10
> +429,18 @@ bool amdgpu_ras_eeprom_check_err_threshold(struct
> amdgpu_device *adev)
>  			return false;
> 
>  	if (con->eeprom_control.tbl_hdr.header == RAS_TABLE_HDR_BAD) {
> -		dev_warn(adev->dev, "This GPU is in BAD status.");
> -		dev_warn(adev->dev, "Please retire it or set a larger "
> -			 "threshold value when reloading driver.\n");
> -		return true;
> +		if (amdgpu_bad_page_threshold == -1) {
> +			dev_warn(adev->dev, "RAS records:%d exceed
> threshold:%d",
> +				con->eeprom_control.ras_num_recs, con-
> >bad_page_cnt_threshold);
> +			dev_warn(adev->dev,
> +				"But GPU can be operated due to
> bad_page_threshold = -1.\n");
> +			return false;
> +		} else {
> +			dev_warn(adev->dev, "This GPU is in BAD status.");
> +			dev_warn(adev->dev, "Please retire it or set a larger
> "
> +				 "threshold value when reloading driver.\n");
> +			return true;
> +		}
>  	}
> 
>  	return false;
> --
> 2.35.1


More information about the amd-gfx mailing list