[PATCH 0/4] enable umc ras ce interrupt
Zhou1, Tao
Tao.Zhou1 at amd.com
Thu Aug 1 08:34:40 UTC 2019
> -----Original Message-----
> From: Chen, Guchun <Guchun.Chen at amd.com>
> Sent: 2019年8月1日 16:22
> To: Zhang, Hawking <Hawking.Zhang at amd.com>; Zhou1, Tao
> <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org; Li, Dennis
> <Dennis.Li at amd.com>; Pan, Xinhui <Xinhui.Pan at amd.com>
> Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>
> Subject: RE: [PATCH 0/4] enable umc ras ce interrupt
>
> 1) Patch 1, looks the return value of our callback always returns UE case, but I
> assume CE case should also be covered. Maybe it's another topic.
> if (ret == AMDGPU_RAS_UE) {
> + /* these counts could be left as 0 if
> + * some blocks do not count error number
> + */
> obj->err_data.ue_count += err_data.ue_count;
> + obj->err_data.ce_count += err_data.ce_count;
>
[Tao] Yes, it's a new topic. CE can also trigger interrupt, and even both ce and ue error can be found in one ras query. I think AMDGPU_RAS_SUCCESS is more suitable here, I'll provide a new patch to fix it.
> 2) In Patch 2, one unused variable "ras_error_status" is there, do we need to
> remove it?
>
> static void umc_v6_1_ras_init(struct amdgpu_device *adev) {
> + void *ras_error_status = NULL;
>
> + amdgpu_umc_for_each_channel(umc_v6_1_ras_init_per_channel);
> }
[Tao] It's on purpose. amdgpu_umc_for_each_channel macro is a common definition for all umc channel functions, it will transfer ras_error_status to channel function.
>
> Regards,
> Guchun
>
> -----Original Message-----
> From: Zhang, Hawking <Hawking.Zhang at amd.com>
> Sent: Thursday, August 1, 2019 3:52 PM
> To: Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org; Li,
> Dennis <Dennis.Li at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>;
> Pan, Xinhui <Xinhui.Pan at amd.com>
> Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>
> Subject: RE: [PATCH 0/4] enable umc ras ce interrupt
>
> 1.) Please fix the typo in patch #2 description: ec --> ce 2). Patch #2
>
> + ecc_err_cnt_sel = REG_SET_FIELD(ecc_err_cnt_sel,
> UMCCH0_0_EccErrCntSel,
> + EccErrInt, 0x1);
> For the EccErrInt field, it should be programed to be (MAX - INIT), correct?
> but the hardcoded value seems not match with the value calculated by those
> macro.
>
> Regards,
> Hawking
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Tao
> Zhou
> Sent: 2019年8月1日 14:54
> To: amd-gfx at lists.freedesktop.org; Zhang, Hawking
> <Hawking.Zhang at amd.com>; Li, Dennis <Dennis.Li at amd.com>; Chen,
> Guchun <Guchun.Chen at amd.com>; Pan, Xinhui <Xinhui.Pan at amd.com>
> Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>
> Subject: [PATCH 0/4] enable umc ras ce interrupt
>
> These patches add support for umc ce interrupt, the interrupt is controlled
> by a error count threshold.
>
> Tao Zhou (4):
> drm/amdgpu: support ce interrupt in ras module
> drm/amdgpu: implement umc ras init function
> drm/amdgpu: update the calc algorithm of umc ecc error count
> drm/amdgpu: only uncorrectable error needs gpu reset
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 12 ++++---
> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 6 +++-
> drivers/gpu/drm/amd/amdgpu/umc_v6_1.c | 42
> ++++++++++++++++++++++---
> drivers/gpu/drm/amd/amdgpu/umc_v6_1.h | 7 +++++
> 4 files changed, 58 insertions(+), 9 deletions(-)
>
> --
> 2.17.1
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
More information about the amd-gfx
mailing list