[PATCH 0/4] add SDMA ras error reporting support

Zhang, Hawking Hawking.Zhang at amd.com
Thu Jan 9 03:20:23 UTC 2020


[AMD Public Use]

To address your concerns

1). The SDMA_EDC_COUTNERS will be cleared by HW after the reading. This is a read-only registers. Either explicitly clear this register or programming EDC_COUNTER_CLEAR register is unnecessary.
2). The error injection and error reporting are actually separated features. That's saying, users may not be allowed to do error injection to generate the error. But once hw edc feature was enabled, the driver should be able to collect and report error information.  

Regards,
Hawking

-----Original Message-----
From: Chen, Guchun <Guchun.Chen at amd.com> 
Sent: Thursday, January 9, 2020 09:04
To: Zhang, Hawking <Hawking.Zhang at amd.com>; amd-gfx at lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher at amd.com>; Li, Dennis <Dennis.Li at amd.com>; Clements, John <John.Clements at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Li, Candice <Candice.Li at amd.com>; Long, Gang <Gang.Long at amd.com>
Cc: Zhang, Hawking <Hawking.Zhang at amd.com>
Subject: RE: [PATCH 0/4] add SDMA ras error reporting support

[AMD Public Use]

Two comments in patch 1.

And one more question for the series is, we add SDMA block case in ras query, but no such case in ras error injection.
Then how we get to know who triggers SDMA ECC counter? Still by the GFX injecton?

With above concerns fixed/clarified, series is:
Reviewed-by: Guchun Chen <guchun.chen at amd.com>

-----Original Message-----
From: Hawking Zhang <Hawking.Zhang at amd.com> 
Sent: Thursday, January 9, 2020 12:17 AM
To: amd-gfx at lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher at amd.com>; Li, Dennis <Dennis.Li at amd.com>; Clements, John <John.Clements at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Li, Candice <Candice.Li at amd.com>; Long, Gang <Gang.Long at amd.com>
Cc: Zhang, Hawking <Hawking.Zhang at amd.com>
Subject: [PATCH 0/4] add SDMA ras error reporting support

Currently, sdma edc counters are grouped in gfx edc counter registers array (sec_ded_counter_registers), which results to several issues including:
1). count sdma ras error into gfx ip blocks when querying gfx error counter (i.e. through sysfs gfx_error_count node).
2). kernel crash (access NULL pointer) when querying gfx error counter on vega20. there is only 2 sdma instances while the gfx edc counter register array unifed arcturus and vega20 cases.
then driver will force to read sdma2 ~ 7 edc counter registers even the ip base address is not initlaized.
3). unnecessary/wrong grbm switch even reading sdma edc counter.

To fix above issue, the series will separate sdma ras query functions from gfx one. check the sdam_edc_counters and report back error count and the error type as well. 

Hawking Zhang (4):
  drm/amdgpu: add query_ras_error_count function for sdma v4
  drm/amdgpu: support error reporting for sdma ip block
  drm/amdgpu: add ras_late_init and ras_fini for sdma v4
  drm/amdgpu: read sdma edc counter to clear the counters

 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c  |   7 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h |   9 ++
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c    |  11 +-
 drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c   | 176 ++++++++++++++++++++++-
 4 files changed, 191 insertions(+), 12 deletions(-)

--
2.17.1


More information about the amd-gfx mailing list