[PATCH v3 0/7] Handle Firmware reported Hardware Errors
Riana Tauro
riana.tauro at intel.com
Wed Jul 2 14:11:10 UTC 2025
Add support to handle firmware reported errors. When CSC firmware
errors are encoutered, a error interrupt is received by the GFX device as
a MSI interrupt.
Device Source control registers indicates the source of the error as CSC
The HEC error status register indicates that the error is firmware reported
Depending on the type of firmware error, the error cause is written to the HEC
Firmware error register.
On encountering such CSC firmware errors, the graphics device is
wedged and the only way to recover from these errors is firmware flash.
Add a vendor-specific recovery method to drm device wedged uevent.
The device will enter runtime survivability mode and send a drm device
wedged uevent when a firmware flash is required to notify userspace.
$ udevadm monitor --property --kernel
monitor will print the received events for:
KERNEL - the kernel uevent
KERNEL[754.709341] change /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 (drm)
ACTION=change
DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0
SUBSYSTEM=drm
WEDGED=vendor-specific
DEVNAME=/dev/dri/card0
DEVTYPE=drm_minor
SEQNUM=5973
MAJOR=226
MINOR=0
Bspec: 50875, 53073, 53074, 53075, 53076
IGT: https://patchwork.freedesktop.org/patch/660122/
Rev2: add a fault injection for csc errors
fix review comments
Rev3: add a vendor-specific recovery method
add support for runtime survivability mode
enable runtime survivability mode when csc errors are reported
Riana Tauro (7):
drm: Add a vendor-specific recovery method to device wedged uevent
drm/xe: Set GT as wedged before sending wedged uevent
drm/xe/xe_survivability: Add support for Runtime survivability mode
drm/xe/doc: Document device wedged and runtime survivability
drm/xe: Add support to handle hardware errors
drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
drm/xe/xe_hw_error: Add fault injection to trigger csc error handler
Documentation/gpu/drm-uapi.rst | 5 +-
Documentation/gpu/xe/index.rst | 1 +
Documentation/gpu/xe/xe_device.rst | 10 +
Documentation/gpu/xe/xe_pcode.rst | 6 +-
drivers/gpu/drm/drm_drv.c | 2 +
drivers/gpu/drm/xe/Makefile | 1 +
drivers/gpu/drm/xe/regs/xe_gsc_regs.h | 2 +
drivers/gpu/drm/xe/regs/xe_hw_error_regs.h | 20 ++
drivers/gpu/drm/xe/regs/xe_irq_regs.h | 1 +
drivers/gpu/drm/xe/xe_debugfs.c | 2 +
drivers/gpu/drm/xe/xe_device.c | 39 +++-
drivers/gpu/drm/xe/xe_device_types.h | 3 +
drivers/gpu/drm/xe/xe_hw_error.c | 187 ++++++++++++++++++
drivers/gpu/drm/xe/xe_hw_error.h | 15 ++
drivers/gpu/drm/xe/xe_irq.c | 4 +
drivers/gpu/drm/xe/xe_survivability_mode.c | 57 +++++-
drivers/gpu/drm/xe/xe_survivability_mode.h | 4 +-
.../gpu/drm/xe/xe_survivability_mode_types.h | 8 +
include/drm/drm_device.h | 4 +
19 files changed, 350 insertions(+), 21 deletions(-)
create mode 100644 Documentation/gpu/xe/xe_device.rst
create mode 100644 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
create mode 100644 drivers/gpu/drm/xe/xe_hw_error.c
create mode 100644 drivers/gpu/drm/xe/xe_hw_error.h
--
2.47.1
More information about the Intel-xe
mailing list