[PATCH 00/10] Supporting RAS on XE

Aravind Iddamsetty aravind.iddamsetty at linux.intel.com
Wed Jul 30 05:48:04 UTC 2025


Rebasing series as a prep for netlink infra series which would be floated
as a follow up.

Our platforms supports Reliability, Availability and Serviceability(RAS).
In case of hardware errors, our hardwares provides the causes via
sending interrupt or pcie errors. The fatal errors are propogated 
as pci errors and non fatal errors as MSI. This series focuses on 
loging and updating counters for these errors, which will be  helpful to
detect and repair hardware faults.

This [1] series proposes mechanism to expose this counters to userspace.
[1]: https://patchwork.freedesktop.org/series/118435/

The error counters  exposed by KMD will be used by L0/sysman 
They will be categorized to specific category of error in sysman:
https://spec.oneapi.io/level-zero/latest/sysman/api.html#ras

We have very limited capabilities for error injection to validate the
code flow.
Output of L3 fabric fatal injection from PVC is:
xe 0000:8c:00.0: [drm] *ERROR* [Hardware Error]: TILE0 detected GT FATAL
error bit[0] is set
xe 0000:8c:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected L3 FABRIC
FATAL error. ERR_VECT_GT_FATAL[7]:0x00000087

Cc: Riana Tauro <riana.tauro at intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray at intel.com>
Cc: Anshuman Gupta <anshuman.gupta at intel.com>

Himal Prasad Ghimiray (10):
  drm/xe: Handle errors from various components.
  drm/xe: Add new helpers to log hardware errrors.
  drm/xe: Log and count the GT hardware errors.
  drm/xe: Support GT hardware error reporting for PVC.
  drm/xe: Support GSC hardware error reporting for PVC.
  drm/xe: Support SOC FATAL error handling for PVC.
  drm/xe: Support SOC NONFATAL error handling for PVC.
  drm/xe: Handle MDFI error severity.
  drm/xe: Clear SOC CORRECTABLE error registers.
  drm/xe: Clear all SoC errors post warm reset.

 drivers/gpu/drm/xe/Makefile                  |   1 +
 drivers/gpu/drm/xe/regs/xe_gt_error_regs.h   |  29 +
 drivers/gpu/drm/xe/regs/xe_irq_regs.h        |   1 +
 drivers/gpu/drm/xe/regs/xe_regs.h            |   3 +
 drivers/gpu/drm/xe/regs/xe_tile_error_regs.h |  66 ++
 drivers/gpu/drm/xe/xe_device.c               |  16 +
 drivers/gpu/drm/xe/xe_device_types.h         |  17 +
 drivers/gpu/drm/xe/xe_gt.c                   |   1 +
 drivers/gpu/drm/xe/xe_gt_printk.h            |   7 +
 drivers/gpu/drm/xe/xe_gt_types.h             |   6 +
 drivers/gpu/drm/xe/xe_hw_error.c             | 896 +++++++++++++++++++
 drivers/gpu/drm/xe/xe_hw_error.h             | 194 ++++
 drivers/gpu/drm/xe/xe_irq.c                  |   1 +
 drivers/gpu/drm/xe/xe_tile.c                 |   2 +
 14 files changed, 1240 insertions(+)
 create mode 100644 drivers/gpu/drm/xe/regs/xe_gt_error_regs.h
 create mode 100644 drivers/gpu/drm/xe/regs/xe_tile_error_regs.h
 create mode 100644 drivers/gpu/drm/xe/xe_hw_error.c
 create mode 100644 drivers/gpu/drm/xe/xe_hw_error.h

-- 
2.25.1



More information about the Intel-xe mailing list