[PATCH 0/4] add SDMA ras error reporting support

Hawking Zhang Hawking.Zhang at amd.com
Wed Jan 8 16:17:17 UTC 2020


Currently, sdma edc counters are grouped in gfx edc counter
registers array (sec_ded_counter_registers), which results
to several issues including:
1). count sdma ras error into gfx ip blocks when querying gfx
error counter (i.e. through sysfs gfx_error_count node).
2). kernel crash (access NULL pointer) when querying gfx error
counter on vega20. there is only 2 sdma instances while the
gfx edc counter register array unifed arcturus and vega20 cases.
then driver will force to read sdma2 ~ 7 edc counter registers
even the ip base address is not initlaized.
3). unnecessary/wrong grbm switch even reading sdma edc counter.

To fix above issue, the series will separate sdma ras query
functions from gfx one. check the sdam_edc_counters and report
back error count and the error type as well. 

Hawking Zhang (4):
  drm/amdgpu: add query_ras_error_count function for sdma v4
  drm/amdgpu: support error reporting for sdma ip block
  drm/amdgpu: add ras_late_init and ras_fini for sdma v4
  drm/amdgpu: read sdma edc counter to clear the counters

 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c  |   7 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h |   9 ++
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c    |  11 +-
 drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c   | 176 ++++++++++++++++++++++-
 4 files changed, 191 insertions(+), 12 deletions(-)

-- 
2.17.1



More information about the amd-gfx mailing list