[PATCH 2/2] drm/amdgpu: exclude duplicate pages from UMC RAS UE count
Yang, Stanley
Stanley.Yang at amd.com
Fri Feb 17 06:49:18 UTC 2023
[AMD Official Use Only - General]
> -----Original Message-----
> From: Zhou1, Tao <Tao.Zhou1 at amd.com>
> Sent: Friday, February 17, 2023 11:53 AM
> To: amd-gfx at lists.freedesktop.org; Zhang, Hawking
> <Hawking.Zhang at amd.com>; Yang, Stanley <Stanley.Yang at amd.com>; Chai,
> Thomas <YiPeng.Chai at amd.com>; Li, Candice <Candice.Li at amd.com>; Lazar,
> Lijo <Lijo.Lazar at amd.com>
> Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>
> Subject: [PATCH 2/2] drm/amdgpu: exclude duplicate pages from UMC RAS
> UE count
>
> If a UMC bad page is reserved but not freed by an application, the application
> may trigger uncorrectable error repeatly by accessing the page.
>
> v2: add specific function to do the check.
> v3: remove duplicate pages, calculate new added bad page number.
>
> Signed-off-by: Tao Zhou <tao.zhou1 at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 23
> +++++++++++++++++++++++
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 2 ++
> drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c | 2 ++
> 3 files changed, 27 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 6e543558386d..777f85f3e5eb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -2115,6 +2115,29 @@ int amdgpu_ras_save_bad_pages(struct
> amdgpu_device *adev)
> return 0;
> }
>
> +/* Remove duplicate pages, calculate new added bad page number.
> + * Note: the function should be called between
> amdgpu_ras_add_bad_pages
> + * and amdgpu_ras_save_bad_pages.
> + */
> +int amdgpu_ras_umc_new_ue_count(struct amdgpu_device *adev) {
> + struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> + struct ras_err_handler_data *data;
> + struct amdgpu_ras_eeprom_control *control;
> + int save_count;
> +
> + if (!con || !con->eh_data)
> + return 0;
> +
> + mutex_lock(&con->recovery_lock);
> + control = &con->eeprom_control;
> + data = con->eh_data;
> + save_count = data->count - control->ras_num_recs;
> + mutex_unlock(&con->recovery_lock);
> +
> + return (save_count / adev->umc.retire_unit); }
Stanley: It's better add comments about the return value.
Without above concern the patch is Reviewed-by: Stanley.Yang <Stanley.Yang at amd.com>
> +
> /*
> * read error record array in eeprom and reserve enough space for
> * storing new bad pages
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index f2ad999993f6..e89c95438a88 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -549,6 +549,8 @@ int amdgpu_ras_add_bad_pages(struct
> amdgpu_device *adev,
>
> int amdgpu_ras_save_bad_pages(struct amdgpu_device *adev);
>
> +int amdgpu_ras_umc_new_ue_count(struct amdgpu_device *adev);
> +
> static inline enum ta_ras_block
> amdgpu_ras_block_to_ta(enum amdgpu_ras_block block) {
> switch (block) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> index 1c7fcb4f2380..45b6be7277dd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.c
> @@ -147,6 +147,8 @@ static int amdgpu_umc_do_page_retirement(struct
> amdgpu_device *adev,
> err_data->err_addr_cnt) {
> amdgpu_ras_add_bad_pages(adev, err_data-
> >err_addr,
> err_data->err_addr_cnt);
> + err_data->ue_count =
> amdgpu_ras_umc_new_ue_count(adev);
> +
> amdgpu_ras_save_bad_pages(adev);
>
> amdgpu_dpm_send_hbm_bad_pages_num(adev,
> con->eeprom_control.ras_num_recs);
> --
> 2.35.1
More information about the amd-gfx
mailing list