[PATCH] drm/amdgpu: refine usage of amdgpu_bad_page_threshold

Zhou1, Tao Tao.Zhou1 at amd.com
Fri Jun 13 04:00:40 UTC 2025


[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Xie, Patrick <Gangliang.Xie at amd.com>
> Sent: Friday, June 13, 2025 11:07 AM
> To: amd-gfx at lists.freedesktop.org
> Cc: Zhang, Hawking <Hawking.Zhang at amd.com>; Zhou1, Tao
> <Tao.Zhou1 at amd.com>; Xie, Patrick <Gangliang.Xie at amd.com>
> Subject: [PATCH] drm/amdgpu: refine usage of amdgpu_bad_page_threshold
>
> when amdgpu_bad_page_threshold == -1 or -2, driver will issue a warning message
> when threshold is reached and continue runtime services.
>
> Signed-off-by: ganglxie <ganglxie at amd.com>
> ---
>  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c    | 21 +++++++++----------
>  1 file changed, 10 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index 2ddedf476542..a9246c53bde9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -763,18 +763,17 @@ amdgpu_ras_eeprom_update_header(struct
> amdgpu_ras_eeprom_control *control)
>               dev_warn(adev->dev,
>                       "Saved bad pages %d reaches threshold value %d\n",
>                       control->ras_num_bad_pages, ras-
> >bad_page_cnt_threshold);
> -             control->tbl_hdr.header = RAS_TABLE_HDR_BAD;
> -             if (control->tbl_hdr.version >= RAS_TABLE_VER_V2_1) {
> -                     control->tbl_rai.rma_status =
> GPU_RETIRED__ECC_REACH_THRESHOLD;
> -                     control->tbl_rai.health_percent = 0;
> -             }
> -
>               if ((amdgpu_bad_page_threshold != -1) &&
> -                 (amdgpu_bad_page_threshold != -2))
> +                 (amdgpu_bad_page_threshold != -2)) {
> +                     control->tbl_hdr.header = RAS_TABLE_HDR_BAD;
> +                     if (control->tbl_hdr.version >= RAS_TABLE_VER_V2_1) {
> +                             control->tbl_rai.rma_status =
> GPU_RETIRED__ECC_REACH_THRESHOLD;
> +                             control->tbl_rai.health_percent = 0;
> +                     }
>                       ras->is_rma = true;
> -
> -             /* ignore the -ENOTSUPP return value */
> -             amdgpu_dpm_send_rma_reason(adev);
> +                     /* ignore the -ENOTSUPP return value */
> +                     amdgpu_dpm_send_rma_reason(adev);
> +             }
>       }
>
>       if (control->tbl_hdr.version >= RAS_TABLE_VER_V2_1) @@ -1509,7
> +1508,7 @@ int amdgpu_ras_eeprom_check(struct amdgpu_ras_eeprom_control
> *control)
>                               "RAS records:%d exceed threshold:%d\n",
>                               control->ras_num_bad_pages, ras-
> >bad_page_cnt_threshold);
>                       if ((amdgpu_bad_page_threshold == -1) ||
> -                         (amdgpu_bad_page_threshold == -2)) {
> +                             (amdgpu_bad_page_threshold == -2)) {

[Tao] the replacement is unnecessary, with this fixed, the patch is:

Reviewed-by: Tao Zhou <tao.zhou1 at amd.com>

>                               res = 0;
>                               dev_warn(adev->dev,
>                                        "Please consult AMD Service Action Guide
> (SAG) for appropriate service procedures\n");
> --
> 2.34.1



More information about the amd-gfx mailing list