[Intel-xe] [PATCH v3 0/4] Supporting RAS on XE
Himal Prasad Ghimiray
himal.prasad.ghimiray at intel.com
Thu Aug 10 07:26:27 UTC 2023
Our platforms support Reliability, Availability and Serviceability(RAS).
In case of hardware errors, our hardwares provides the causes via
sending interrupt or pcie errors. The fatal errors are propogated
as pci errors and non fatal errors as MSI. This series focuses on
loging and updating counters for these errors, which will be helpful to avoid,
detect and repair hardware faults.
This [1] series proposes mechanism to expose this counters to userspace.
[1]: https://patchwork.freedesktop.org/series/118435/
The error counters exposed by KMD will be used by L0/sysman
They will be categorized to specific category of error in sysman:
https://spec.oneapi.io/level-zero/latest/sysman/api.html#ras
Current version covers the error propagation at tile level and possible
causes from GT subsystem. Future patches will cover the possible causes
from other subsystems (for example SOC, GSC/CSC).
We have very limited capabilities for error injection to validate the
code flow.
Output of L3 fabric fatal injection from PVC is:
xe 0000:8c:00.0: [drm] *ERROR* [Hardware Error]: TILE0 detected GT FATAL error bit[0] is set
xe 0000:8c:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected L3 FABRIC FATAL error. ERR_VECT_GT_FATAL_[7]:0x00000087
v2
- Use different headers for error registers. (Nikula)
- Correctable errors shouldn't be considered as dmesg errors (Matt)
- Limit series to HW errors.(Aravind)
v3
- Rebase
Himal Prasad Ghimiray (4):
drm/xe: Handle errors from various components.
drm/xe: Log and count the GT hardware errors.
drm/xe: Support GT hardware error reporting for PVC.
drm/xe: Process fatal hardware errors.
drivers/gpu/drm/xe/regs/xe_gt_error_regs.h | 99 ++++
drivers/gpu/drm/xe/regs/xe_regs.h | 7 +
drivers/gpu/drm/xe/regs/xe_tile_error_regs.h | 108 +++++
drivers/gpu/drm/xe/xe_device_types.h | 6 +
drivers/gpu/drm/xe/xe_gt_types.h | 6 +
drivers/gpu/drm/xe/xe_irq.c | 460 +++++++++++++++++++
6 files changed, 686 insertions(+)
create mode 100644 drivers/gpu/drm/xe/regs/xe_gt_error_regs.h
create mode 100644 drivers/gpu/drm/xe/regs/xe_tile_error_regs.h
--
2.25.1
More information about the Intel-xe
mailing list