[PATCH] drm/amdgpu: Modify the count method of defer error

Zhou1, Tao Tao.Zhou1 at amd.com
Wed May 7 03:53:16 UTC 2025


[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Sun, Ce(Overlord) <Ce.Sun at amd.com>
> Sent: Wednesday, May 7, 2025 10:25 AM
> To: amd-gfx at lists.freedesktop.org
> Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>; Zhang, Hawking
> <Hawking.Zhang at amd.com>; Sun, Ce(Overlord) <Ce.Sun at amd.com>
> Subject: [PATCH] drm/amdgpu: Modify the count method of defer error
>
> The number of newly added de counts and the number of newly added error
> addresses remain consistent
>
> Signed-off-by: Ce Sun <cesun102 at amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h |  1 +
> drivers/gpu/drm/amd/amdgpu/umc_v12_0.c  | 11 +++++++++--
>  2 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> index 857693bcd8d4..52fb71c4ce9d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_umc.h
> @@ -130,6 +130,7 @@ struct amdgpu_umc {
>
>       /* active mask for umc node instance */
>       unsigned long active_mask;
> +     unsigned long err_addr_cnt;
>  };
>
>  int amdgpu_umc_ras_sw_init(struct amdgpu_device *adev); diff --git
> a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> index 0e404c074975..eb3f99dbbcd7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/umc_v12_0.c
> @@ -262,6 +262,9 @@ static int umc_v12_0_convert_error_address(struct
> amdgpu_device *adev,
>                               soc_pa, channel_index, umc_inst);
>       }
>
> +     if (err_data)
> +             adev->umc.err_addr_cnt = err_data->err_addr_cnt;

[Tao] better not to update it here.
The error address can't be queried sometimes but error status and poison consumption are generated, adev->umc.err_addr_cnt should 0 in this case, but there is no chance to set it to 0 if it's configured here.

> +
>  out:
>       return ret;
>  }
> @@ -428,8 +431,12 @@ static int umc_v12_0_aca_bank_parser(struct aca_handle
> *handle, struct aca_bank
>               bank->regs[ACA_REG_IDX_ADDR]);
>
>       ext_error_code = ACA_REG__STATUS__ERRORCODEEXT(status);
> -     count = ext_error_code == 0 ?
> -             ACA_REG__MISC0__ERRCNT(bank-
> >regs[ACA_REG_IDX_MISC0]) : 1ULL;
> +     if (umc_v12_0_is_deferred_error(adev, status))
> +             count = ext_error_code == 0 ?
> +                     adev->umc.err_addr_cnt / adev->umc.retire_unit : 1ULL;
> +     else
> +             count = ext_error_code == 0 ?
> +                     ACA_REG__MISC0__ERRCNT(bank-
> >regs[ACA_REG_IDX_MISC0]) : 1ULL;
>
>       return aca_error_cache_log_bank_error(handle, &info, err_type, count);  }
> --
> 2.34.1



More information about the amd-gfx mailing list