[PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad page

Mon Feb 13 08:25:00 UTC 2023

[AMD Official Use Only - General]

> -----Original Message-----
> From: Zhang, Hawking <Hawking.Zhang at amd.com>
> Sent: Friday, February 10, 2023 11:02 PM
> To: Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org; Yang,
> Stanley <Stanley.Yang at amd.com>; Chai, Thomas <YiPeng.Chai at amd.com>; Li,
> Candice <Candice.Li at amd.com>
> Subject: RE: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new
> bad page
>
> [AMD Official Use Only - General]
>
> +                       /* if no new bad page is found, no need to increase ue count */
> +                       if (ret == -EEXIST)
> +                               err_data->ue_count = 0;
>
> Returning EEXIST in such case is not reasonable. Might consider return a bool for
> amdgpu_ras_add_bad_pages: true means it does add some new bad page; false
> means it doesn't change anything.
>
> Regards,
> Hawking

[Tao] but it can returns -ENOMEM, amdgpu_ras_load_bad_pages and amdgpu_ras_recovery_init also need to check the return value. I'd like to keep the type of return value unchanged.
How about -EINVAL?

>
> -----Original Message-----
> From: Zhou1, Tao <Tao.Zhou1 at amd.com>
> Sent: Friday, February 10, 2023 16:45
> To: amd-gfx at lists.freedesktop.org; Zhang, Hawking
> <Hawking.Zhang at amd.com>; Yang, Stanley <Stanley.Yang at amd.com>; Chai,
> Thomas <YiPeng.Chai at amd.com>; Li, Candice <Candice.Li at amd.com>
> Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>
> Subject: [PATCH] drm/amdgpu: don't increase UMC RAS UE count if no new bad
> page
>
> If a UMC bad page is reserved but not freed by an application, the application
> may trigger uncorrectable error repeatly by accessing the page.
>
> Signed-off-by: Tao Zhou <tao.zhou1 at amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 9 ++++++++-
> drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 6 +++++-
>  2 files changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index e85c4689ce2c..eafe01a24349 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2049,7 +2049,7 @@ int amdgpu_ras_add_bad_pages(struct
> amdgpu_device *adev,  {
>         struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>         struct ras_err_handler_data *data;
> -       int ret = 0;
> +       int ret = 0, old_cnt;
>         uint32_t i;
>
>         if (!con || !con->eh_data || !bps || pages <= 0) @@ -2060,6 +2060,8 @@
> int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>         if (!data)
>                 goto out;
>
> +       old_cnt = data->count;
> +
>         for (i = 0; i < pages; i++) {
>                 if (amdgpu_ras_check_bad_page_unlock(con,
>                         bps[i].retired_page << AMDGPU_GPU_PAGE_SHIFT)) @@ -2079,6
> +2081,11 @@ int amdgpu_ras_add_bad_pages(struct amdgpu_device *adev,
>                 data->count++;
>                 data->space_left--;
>         }
> +
> +       /* all pages have been reserved before, no new bad page */
> +       if (old_cnt == data->count)
> +               ret = -EEXIST;
> +
>  out:
>         mutex_unlock(&con->recovery_lock);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> index 1c7fcb4f2380..772c431e4065 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> @@ -145,8 +145,12 @@ static int amdgpu_umc_do_page_retirement(struct
> amdgpu_device *adev,
>
>                 if ((amdgpu_bad_page_threshold != 0) &&
>                         err_data->err_addr_cnt) {
> -                       amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
> +                       ret = amdgpu_ras_add_bad_pages(adev, err_data->err_addr,
>                                                 err_data->err_addr_cnt);
> +                       /* if no new bad page is found, no need to increase ue count */
> +                       if (ret == -EEXIST)
> +                               err_data->ue_count = 0;
> +
>                         amdgpu_ras_save_bad_pages(adev);
>
>                         amdgpu_dpm_send_hbm_bad_pages_num(adev, con-
> >eeprom_control.ras_num_recs);
> --
> 2.35.1
>