[PATCH 0/4] enable umc ras ce interrupt

Zhou1, Tao Tao.Zhou1 at amd.com
Thu Aug 1 08:34:40 UTC 2019



> -----Original Message-----
> From: Chen, Guchun <Guchun.Chen at amd.com>
> Sent: 2019年8月1日 16:22
> To: Zhang, Hawking <Hawking.Zhang at amd.com>; Zhou1, Tao
> <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org; Li, Dennis
> <Dennis.Li at amd.com>; Pan, Xinhui <Xinhui.Pan at amd.com>
> Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>
> Subject: RE: [PATCH 0/4] enable umc ras ce interrupt
> 
> 1) Patch 1, looks the return value of our callback always returns UE case, but I
> assume CE case should also be covered. Maybe it's another topic.
> 	if (ret == AMDGPU_RAS_UE) {
> +		/* these counts could be left as 0 if
> +		 * some blocks do not count error number
> +		 */
>  		obj->err_data.ue_count += err_data.ue_count;
> +		obj->err_data.ce_count += err_data.ce_count;
> 
[Tao] Yes, it's a new topic. CE can also trigger interrupt, and even both ce and ue error can be found in one ras query. I think AMDGPU_RAS_SUCCESS is more suitable here, I'll provide a new patch to fix it.

> 2) In Patch 2, one unused variable "ras_error_status" is there, do we need to
> remove it?
> 
> static void umc_v6_1_ras_init(struct amdgpu_device *adev)  {
> +	void *ras_error_status = NULL;
> 
> +	amdgpu_umc_for_each_channel(umc_v6_1_ras_init_per_channel);
>  }
[Tao] It's on purpose. amdgpu_umc_for_each_channel macro is a common definition for all umc channel functions, it will transfer ras_error_status to channel function.

> 
> Regards,
> Guchun
> 
> -----Original Message-----
> From: Zhang, Hawking <Hawking.Zhang at amd.com>
> Sent: Thursday, August 1, 2019 3:52 PM
> To: Zhou1, Tao <Tao.Zhou1 at amd.com>; amd-gfx at lists.freedesktop.org; Li,
> Dennis <Dennis.Li at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>;
> Pan, Xinhui <Xinhui.Pan at amd.com>
> Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>
> Subject: RE: [PATCH 0/4] enable umc ras ce interrupt
> 
> 1.) Please fix the typo in patch #2 description: ec --> ce 2). Patch #2
> 
> +	ecc_err_cnt_sel = REG_SET_FIELD(ecc_err_cnt_sel,
> UMCCH0_0_EccErrCntSel,
> +					EccErrInt, 0x1);
> For the EccErrInt field, it should be programed to be (MAX - INIT), correct?
> but the hardcoded value seems not match with the value calculated by those
> macro.
> 
> Regards,
> Hawking
> -----Original Message-----
> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Tao
> Zhou
> Sent: 2019年8月1日 14:54
> To: amd-gfx at lists.freedesktop.org; Zhang, Hawking
> <Hawking.Zhang at amd.com>; Li, Dennis <Dennis.Li at amd.com>; Chen,
> Guchun <Guchun.Chen at amd.com>; Pan, Xinhui <Xinhui.Pan at amd.com>
> Cc: Zhou1, Tao <Tao.Zhou1 at amd.com>
> Subject: [PATCH 0/4] enable umc ras ce interrupt
> 
> These patches add support for umc ce interrupt, the interrupt is controlled
> by a error count threshold.
> 
> Tao Zhou (4):
>   drm/amdgpu: support ce interrupt in ras module
>   drm/amdgpu: implement umc ras init function
>   drm/amdgpu: update the calc algorithm of umc ecc error count
>   drm/amdgpu: only uncorrectable error needs gpu reset
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 12 ++++---
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c   |  6 +++-
>  drivers/gpu/drm/amd/amdgpu/umc_v6_1.c   | 42
> ++++++++++++++++++++++---
>  drivers/gpu/drm/amd/amdgpu/umc_v6_1.h   |  7 +++++
>  4 files changed, 58 insertions(+), 9 deletions(-)
> 
> --
> 2.17.1
> 
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


More information about the amd-gfx mailing list