[RFC i-g-t v3 0/1] A tool to demonstrate use of netlink sockets to read RAS error counters
Rodrigo Vivi
rodrigo.vivi at intel.com
Wed Jul 30 20:05:55 UTC 2025
On Wed, Jul 30, 2025 at 11:43:41AM +0530, Aravind Iddamsetty wrote:
> This tool is to demonstrate the use of netlink sockets to read RAS error
> counters, which is being proposed via series
Alex, what tools are in use for RAS on AMD side?
I noticed something in the mesa repo recently. But perhaps you have other
high level tools as well?
I'm wondering if we should we try consolidate in this tool here in IGT
as some official one to drive the RAS netlink APIs in a unified way? Besides
converting any other tool to this API of course.
> "[RFC v5 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem".
>
> v2: update uapi header.
> v3: Add DRM_RAS_CMD_READ_BLOCK command to read errors from an IP Block.
>
> The tool supports the following commands:
> READ_ONE, READ_BLOCK, READ_ALL, WAIT_ON_EVENT, LIST_ERRORS
>
> read single error counter:
>
> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0000000000000005
> counter value 0
>
> read all error counters:
>
> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
> name config-id counter
>
> error-gt0-correctable-guc 0x0000000000000001 0
> error-gt0-correctable-slm 0x0000000000000003 0
> error-gt0-correctable-eu-ic 0x0000000000000004 0
> error-gt0-correctable-eu-grf 0x0000000000000005 0
> error-gt0-fatal-guc 0x0000000000000009 0
> error-gt0-fatal-slm 0x000000000000000d 0
> error-gt0-fatal-eu-grf 0x000000000000000f 0
> error-gt0-fatal-fpu 0x0000000000000010 0
> error-gt0-fatal-tlb 0x0000000000000011 0
> error-gt0-fatal-l3-fabric 0x0000000000000012 0
> error-gt0-correctable-subslice 0x0000000000000013 0
> error-gt0-correctable-l3bank 0x0000000000000014 0
> error-gt0-fatal-subslice 0x0000000000000015 0
> error-gt0-fatal-l3bank 0x0000000000000016 0
> error-gt0-sgunit-correctable 0x0000000000000017 0
> error-gt0-sgunit-nonfatal 0x0000000000000018 0
> error-gt0-sgunit-fatal 0x0000000000000019 0
> error-gt0-soc-fatal-psf-csc-0 0x000000000000001a 0
> error-gt0-soc-fatal-psf-csc-1 0x000000000000001b 0
> error-gt0-soc-fatal-psf-csc-2 0x000000000000001c 0
> error-gt0-soc-fatal-punit 0x000000000000001d 0
> error-gt0-soc-fatal-psf-0 0x000000000000001e 0
> error-gt0-soc-fatal-psf-1 0x000000000000001f 0
> error-gt0-soc-fatal-psf-2 0x0000000000000020 0
> error-gt0-soc-fatal-cd0 0x0000000000000021 0
> error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022 0
> error-gt0-soc-fatal-mdfi-east 0x0000000000000023 0
> error-gt0-soc-fatal-mdfi-south 0x0000000000000024 0
> error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025 0
> error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026 0
> error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027 0
> error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028 0
> error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029 0
> error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a 0
> error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b 0
> error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c 0
> error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d 0
> error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e 0
> error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f 0
> error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030 0
> error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031 0
> error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032 0
> error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033 0
> error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034 0
> error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035 0
> error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036 0
> error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037 0
> error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038 0
> error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039 0
> error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a 0
> error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b 0
> error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c 0
> error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d 0
> error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e 0
> error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f 0
> error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040 0
> error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041 0
> error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042 0
> error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043 0
> error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044 0
> error-gt0-gsc-correctable-sram-ecc 0x0000000000000045 0
> error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046 0
> error-gt0-gsc-nonfatal-mia-int 0x0000000000000047 0
> error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048 0
> error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049 0
> error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a 0
> error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b 0
> error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c 0
> error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d 0
> error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e 0
> error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f 0
> error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050 0
> error-gt1-correctable-guc 0x1000000000000001 0
> error-gt1-correctable-slm 0x1000000000000003 0
> error-gt1-correctable-eu-ic 0x1000000000000004 0
> error-gt1-correctable-eu-grf 0x1000000000000005 0
> error-gt1-fatal-guc 0x1000000000000009 0
> error-gt1-fatal-slm 0x100000000000000d 0
> error-gt1-fatal-eu-grf 0x100000000000000f 0
> error-gt1-fatal-fpu 0x1000000000000010 0
> error-gt1-fatal-tlb 0x1000000000000011 0
> error-gt1-fatal-l3-fabric 0x1000000000000012 0
> error-gt1-correctable-subslice 0x1000000000000013 0
> error-gt1-correctable-l3bank 0x1000000000000014 0
> error-gt1-fatal-subslice 0x1000000000000015 0
> error-gt1-fatal-l3bank 0x1000000000000016 0
> error-gt1-sgunit-correctable 0x1000000000000017 0
> error-gt1-sgunit-nonfatal 0x1000000000000018 0
> error-gt1-sgunit-fatal 0x1000000000000019 0
> error-gt1-soc-fatal-psf-csc-0 0x100000000000001a 0
> error-gt1-soc-fatal-psf-csc-1 0x100000000000001b 0
> error-gt1-soc-fatal-psf-csc-2 0x100000000000001c 0
> error-gt1-soc-fatal-punit 0x100000000000001d 0
> error-gt1-soc-fatal-psf-0 0x100000000000001e 0
> error-gt1-soc-fatal-psf-1 0x100000000000001f 0
> error-gt1-soc-fatal-psf-2 0x1000000000000020 0
> error-gt1-soc-fatal-cd0 0x1000000000000021 0
> error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022 0
> error-gt1-soc-fatal-mdfi-east 0x1000000000000023 0
> error-gt1-soc-fatal-mdfi-south 0x1000000000000024 0
> error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025 0
> error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026 0
> error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027 0
> error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028 0
> error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029 0
> error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a 0
> error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b 0
> error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c 0
> error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d 0
> error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e 0
> error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f 0
> error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030 0
> error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031 0
> error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032 0
> error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033 0
> error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034 0
> error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035 0
> error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036 0
> error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037 0
> error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038 0
> error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039 0
> error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a 0
> error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b 0
> error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c 0
> error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d 0
> error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e 0
> error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f 0
> error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040 0
> error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041 0
> error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042 0
> error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043 0
> error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044 0
>
> wait on a error event:
>
> $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1
> waiting for error event
> error event received
> counter value 0
>
> list all errors:
>
> $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1
> name config-id
>
> error-gt0-correctable-guc 0x0000000000000001
> error-gt0-correctable-slm 0x0000000000000003
> error-gt0-correctable-eu-ic 0x0000000000000004
> error-gt0-correctable-eu-grf 0x0000000000000005
> error-gt0-fatal-guc 0x0000000000000009
> error-gt0-fatal-slm 0x000000000000000d
> error-gt0-fatal-eu-grf 0x000000000000000f
> error-gt0-fatal-fpu 0x0000000000000010
> error-gt0-fatal-tlb 0x0000000000000011
> error-gt0-fatal-l3-fabric 0x0000000000000012
> error-gt0-correctable-subslice 0x0000000000000013
> error-gt0-correctable-l3bank 0x0000000000000014
> error-gt0-fatal-subslice 0x0000000000000015
> error-gt0-fatal-l3bank 0x0000000000000016
> error-gt0-sgunit-correctable 0x0000000000000017
> error-gt0-sgunit-nonfatal 0x0000000000000018
> error-gt0-sgunit-fatal 0x0000000000000019
> error-gt0-soc-fatal-psf-csc-0 0x000000000000001a
> error-gt0-soc-fatal-psf-csc-1 0x000000000000001b
> error-gt0-soc-fatal-psf-csc-2 0x000000000000001c
> error-gt0-soc-fatal-punit 0x000000000000001d
> error-gt0-soc-fatal-psf-0 0x000000000000001e
> error-gt0-soc-fatal-psf-1 0x000000000000001f
> error-gt0-soc-fatal-psf-2 0x0000000000000020
> error-gt0-soc-fatal-cd0 0x0000000000000021
> error-gt0-soc-fatal-cd0-mdfi 0x0000000000000022
> error-gt0-soc-fatal-mdfi-east 0x0000000000000023
> error-gt0-soc-fatal-mdfi-south 0x0000000000000024
> error-gt0-soc-fatal-hbm-ss0-0 0x0000000000000025
> error-gt0-soc-fatal-hbm-ss0-1 0x0000000000000026
> error-gt0-soc-fatal-hbm-ss0-2 0x0000000000000027
> error-gt0-soc-fatal-hbm-ss0-3 0x0000000000000028
> error-gt0-soc-fatal-hbm-ss0-4 0x0000000000000029
> error-gt0-soc-fatal-hbm-ss0-5 0x000000000000002a
> error-gt0-soc-fatal-hbm-ss0-6 0x000000000000002b
> error-gt0-soc-fatal-hbm-ss0-7 0x000000000000002c
> error-gt0-soc-fatal-hbm-ss1-0 0x000000000000002d
> error-gt0-soc-fatal-hbm-ss1-1 0x000000000000002e
> error-gt0-soc-fatal-hbm-ss1-2 0x000000000000002f
> error-gt0-soc-fatal-hbm-ss1-3 0x0000000000000030
> error-gt0-soc-fatal-hbm-ss1-4 0x0000000000000031
> error-gt0-soc-fatal-hbm-ss1-5 0x0000000000000032
> error-gt0-soc-fatal-hbm-ss1-6 0x0000000000000033
> error-gt0-soc-fatal-hbm-ss1-7 0x0000000000000034
> error-gt0-soc-fatal-hbm-ss2-0 0x0000000000000035
> error-gt0-soc-fatal-hbm-ss2-1 0x0000000000000036
> error-gt0-soc-fatal-hbm-ss2-2 0x0000000000000037
> error-gt0-soc-fatal-hbm-ss2-3 0x0000000000000038
> error-gt0-soc-fatal-hbm-ss2-4 0x0000000000000039
> error-gt0-soc-fatal-hbm-ss2-5 0x000000000000003a
> error-gt0-soc-fatal-hbm-ss2-6 0x000000000000003b
> error-gt0-soc-fatal-hbm-ss2-7 0x000000000000003c
> error-gt0-soc-fatal-hbm-ss3-0 0x000000000000003d
> error-gt0-soc-fatal-hbm-ss3-1 0x000000000000003e
> error-gt0-soc-fatal-hbm-ss3-2 0x000000000000003f
> error-gt0-soc-fatal-hbm-ss3-3 0x0000000000000040
> error-gt0-soc-fatal-hbm-ss3-4 0x0000000000000041
> error-gt0-soc-fatal-hbm-ss3-5 0x0000000000000042
> error-gt0-soc-fatal-hbm-ss3-6 0x0000000000000043
> error-gt0-soc-fatal-hbm-ss3-7 0x0000000000000044
> error-gt0-gsc-correctable-sram-ecc 0x0000000000000045
> error-gt0-gsc-nonfatal-mia-shutdown 0x0000000000000046
> error-gt0-gsc-nonfatal-mia-int 0x0000000000000047
> error-gt0-gsc-nonfatal-sram-ecc 0x0000000000000048
> error-gt0-gsc-nonfatal-wdg-timeout 0x0000000000000049
> error-gt0-gsc-nonfatal-rom-parity 0x000000000000004a
> error-gt0-gsc-nonfatal-ucode-parity 0x000000000000004b
> error-gt0-gsc-nonfatal-glitch-det 0x000000000000004c
> error-gt0-gsc-nonfatal-fuse-pull 0x000000000000004d
> error-gt0-gsc-nonfatal-fuse-crc-check 0x000000000000004e
> error-gt0-gsc-nonfatal-selfmbist 0x000000000000004f
> error-gt0-gsc-nonfatal-aon-parity 0x0000000000000050
> error-gt1-correctable-guc 0x1000000000000001
> error-gt1-correctable-slm 0x1000000000000003
> error-gt1-correctable-eu-ic 0x1000000000000004
> error-gt1-correctable-eu-grf 0x1000000000000005
> error-gt1-fatal-guc 0x1000000000000009
> error-gt1-fatal-slm 0x100000000000000d
> error-gt1-fatal-eu-grf 0x100000000000000f
> error-gt1-fatal-fpu 0x1000000000000010
> error-gt1-fatal-tlb 0x1000000000000011
> error-gt1-fatal-l3-fabric 0x1000000000000012
> error-gt1-correctable-subslice 0x1000000000000013
> error-gt1-correctable-l3bank 0x1000000000000014
> error-gt1-fatal-subslice 0x1000000000000015
> error-gt1-fatal-l3bank 0x1000000000000016
> error-gt1-sgunit-correctable 0x1000000000000017
> error-gt1-sgunit-nonfatal 0x1000000000000018
> error-gt1-sgunit-fatal 0x1000000000000019
> error-gt1-soc-fatal-psf-csc-0 0x100000000000001a
> error-gt1-soc-fatal-psf-csc-1 0x100000000000001b
> error-gt1-soc-fatal-psf-csc-2 0x100000000000001c
> error-gt1-soc-fatal-punit 0x100000000000001d
> error-gt1-soc-fatal-psf-0 0x100000000000001e
> error-gt1-soc-fatal-psf-1 0x100000000000001f
> error-gt1-soc-fatal-psf-2 0x1000000000000020
> error-gt1-soc-fatal-cd0 0x1000000000000021
> error-gt1-soc-fatal-cd0-mdfi 0x1000000000000022
> error-gt1-soc-fatal-mdfi-east 0x1000000000000023
> error-gt1-soc-fatal-mdfi-south 0x1000000000000024
> error-gt1-soc-fatal-hbm-ss0-0 0x1000000000000025
> error-gt1-soc-fatal-hbm-ss0-1 0x1000000000000026
> error-gt1-soc-fatal-hbm-ss0-2 0x1000000000000027
> error-gt1-soc-fatal-hbm-ss0-3 0x1000000000000028
> error-gt1-soc-fatal-hbm-ss0-4 0x1000000000000029
> error-gt1-soc-fatal-hbm-ss0-5 0x100000000000002a
> error-gt1-soc-fatal-hbm-ss0-6 0x100000000000002b
> error-gt1-soc-fatal-hbm-ss0-7 0x100000000000002c
> error-gt1-soc-fatal-hbm-ss1-0 0x100000000000002d
> error-gt1-soc-fatal-hbm-ss1-1 0x100000000000002e
> error-gt1-soc-fatal-hbm-ss1-2 0x100000000000002f
> error-gt1-soc-fatal-hbm-ss1-3 0x1000000000000030
> error-gt1-soc-fatal-hbm-ss1-4 0x1000000000000031
> error-gt1-soc-fatal-hbm-ss1-5 0x1000000000000032
> error-gt1-soc-fatal-hbm-ss1-6 0x1000000000000033
> error-gt1-soc-fatal-hbm-ss1-7 0x1000000000000034
> error-gt1-soc-fatal-hbm-ss2-0 0x1000000000000035
> error-gt1-soc-fatal-hbm-ss2-1 0x1000000000000036
> error-gt1-soc-fatal-hbm-ss2-2 0x1000000000000037
> error-gt1-soc-fatal-hbm-ss2-3 0x1000000000000038
> error-gt1-soc-fatal-hbm-ss2-4 0x1000000000000039
> error-gt1-soc-fatal-hbm-ss2-5 0x100000000000003a
> error-gt1-soc-fatal-hbm-ss2-6 0x100000000000003b
> error-gt1-soc-fatal-hbm-ss2-7 0x100000000000003c
> error-gt1-soc-fatal-hbm-ss3-0 0x100000000000003d
> error-gt1-soc-fatal-hbm-ss3-1 0x100000000000003e
> error-gt1-soc-fatal-hbm-ss3-2 0x100000000000003f
> error-gt1-soc-fatal-hbm-ss3-3 0x1000000000000040
> error-gt1-soc-fatal-hbm-ss3-4 0x1000000000000041
> error-gt1-soc-fatal-hbm-ss3-5 0x1000000000000042
> error-gt1-soc-fatal-hbm-ss3-6 0x1000000000000043
> error-gt1-soc-fatal-hbm-ss3-7 0x1000000000000044
>
> Cc: Alex Deucher <alexander.deucher at amd.com>
> Cc: Simona Vetter <simona at ffwll.ch>
> Cc: David Airlie <airlied at gmail.com>
> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
> Cc: Hawking Zhang <Hawking.Zhang at amd.com>
> Cc: Lijo Lazar <lijo.lazar at amd.com>
> Cc: Riana Tauro <riana.tauro at intel.com>
> Cc: Anshuman Gupta <anshuman.gupta at intel.com>
>
>
> Aravind Iddamsetty (1):
> tools/RAS: A tool to read error counters
>
> include/drm-uapi/drm_netlink.h | 105 ++++++++
> meson.build | 4 +
> tools/drm_ras.c | 428 +++++++++++++++++++++++++++++++++
> tools/meson.build | 5 +
> 4 files changed, 542 insertions(+)
> create mode 100644 include/drm-uapi/drm_netlink.h
> create mode 100644 tools/drm_ras.c
>
> --
> 2.25.1
>
More information about the igt-dev
mailing list