<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
</head>
<body>
<div dir="auto">Please ignore this. </div>
<div dir="auto">Will resent with fixing the Subject. </div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com><br>
<b>Sent:</b> Wednesday, October 18, 2023 8:18:28 AM<br>
<b>To:</b> intel-xe@lists.freedesktop.org <intel-xe@lists.freedesktop.org><br>
<b>Cc:</b> Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com><br>
<b>Subject:</b> [PATCH v8 00/10] *Supporting RAS on XE</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">
<div class="PlainText">Our platforms support Reliability, Availability and Serviceability(RAS).<br>
In case of hardware errors, our hardwares provides the causes via<br>
sending interrupt or pcie errors. The fatal errors are propogated <br>
as pci errors and non fatal errors as MSI. This series focuses on <br>
loging and updating counters for these errors, which will be helpful to avoid, <br>
detect and repair hardware faults.<br>
<br>
This [1] series proposes mechanism to expose this counters to userspace.<br>
[1]: <a href="https://patchwork.freedesktop.org/series/118435/">https://patchwork.freedesktop.org/series/118435/</a><br>
<br>
The error counters exposed by KMD will be used by L0/sysman <br>
They will be categorized to specific category of error in sysman:<br>
<a href="https://spec.oneapi.io/level-zero/latest/sysman/api.html#ras">https://spec.oneapi.io/level-zero/latest/sysman/api.html#ras</a><br>
<br>
We have very limited capabilities for error injection to validate the<br>
code flow.<br>
Output of L3 fabric fatal injection from PVC is:<br>
xe 0000:8c:00.0: [drm] *ERROR* [Hardware Error]: TILE0 detected GT FATAL error bit[0] is set<br>
xe 0000:8c:00.0: [drm] *ERROR* [Hardware Error]: GT0 detected L3 FABRIC FATAL error. ERR_VECT_GT_FATAL[7]:0x00000087<br>
<br>
v2<br>
- Use different headers for error registers. (Nikula)<br>
- Correctable errors shouldn't be considered as dmesg errors (Matt)<br>
- Limit series to HW errors.(Aravind)<br>
<br>
v3<br>
- Rebase<br>
<br>
v4<br>
- Use xe_regs.h only for registers, move enums out of it.<br>
- Make sure global data/structures are immutable.<br>
- Avoid adding custom error logging macro's.<br>
- Redesign the registers error name and counter index<br>
structures for maintainability. (Nikula)<br>
<br>
v5<br>
- move struct hw_err_regs out of CONFIG_DRM_XE_DISPLAY.<br>
<br>
v6<br>
- Addressing Review comments from Aravind.<br>
<br>
v7<br>
- CI fixes.<br>
<br>
v8<br>
- Conditional check fixes.<br>
<br>
Himal Prasad Ghimiray (4):<br>
drm/xe: Handle errors from various components.<br>
drm/xe: Log and count the GT hardware errors.<br>
drm/xe: Support GT hardware error reporting for PVC.<br>
drm/xe: Support GSC hardware error reporting for PVC.<br>
drm/xe: Notify userspace about GSC HW errors.<br>
drm/xe: Support SOC FATAL error handling for PVC.<br>
drm/xe: Support SOC NONFATAL error handling for PVC.<br>
drm/xe: Handle MDFI error severity.<br>
drm/xe: Clear SOC CORRECTABLE error registers.<br>
drm/xe: Clear all SoC errors post warm reset.<br>
<br>
drivers/gpu/drm/xe/Makefile | 1 +<br>
drivers/gpu/drm/xe/regs/xe_gt_error_regs.h | 29 +<br>
drivers/gpu/drm/xe/regs/xe_regs.h | 5 +-<br>
drivers/gpu/drm/xe/regs/xe_tile_error_regs.h | 65 ++<br>
drivers/gpu/drm/xe/xe_device.c | 14 +<br>
drivers/gpu/drm/xe/xe_device_types.h | 21 +<br>
drivers/gpu/drm/xe/xe_gt.c | 1 +<br>
drivers/gpu/drm/xe/xe_gt_types.h | 7 +<br>
drivers/gpu/drm/xe/xe_hw_error.c | 919 +++++++++++++++++++<br>
drivers/gpu/drm/xe/xe_hw_error.h | 210 +++++<br>
drivers/gpu/drm/xe/xe_irq.c | 9 +<br>
drivers/gpu/drm/xe/xe_tile.c | 1 +<br>
include/uapi/drm/xe_drm.h | 8 +<br>
13 files changed, 1289 insertions(+), 1 deletion(-)<br>
create mode 100644 drivers/gpu/drm/xe/regs/xe_gt_error_regs.h<br>
create mode 100644 drivers/gpu/drm/xe/regs/xe_tile_error_regs.h<br>
create mode 100644 drivers/gpu/drm/xe/xe_hw_error.c<br>
create mode 100644 drivers/gpu/drm/xe/xe_hw_error.h<br>
<br>
-- <br>
2.25.1<br>
<br>
</div>
</span></font></div>
</body>
</html>