[PATCH v5 0/9] Handle Firmware reported Hardware Errors
Riana Tauro
riana.tauro at intel.com
Tue Jul 15 10:47:20 UTC 2025
Add support to handle firmware reported errors. When CSC firmware
errors are encoutered, a error interrupt is received by the GFX device as
a MSI interrupt. Device Source control registers indicates the source of the error
as CSC. The HEC error status register indicates that the error is firmware reported
Depending on the type of firmware error, the error cause is written to the HEC
Firmware error register. On encountering such CSC firmware errors, the device is unusable
and can be recovered only using firmware update.
Whenever XE KMD detects such a firmware error, a drm wedged
system administrator/userspace needs to be notified to trigger
a firmware update. To address the above need, drm wedged uevent with a new
recovery method and runtime survivability is used.
The initial proposal to add 'firmware-flash' as a recovery method was
not applicable to other drivers and could cause multiple recovery
methods specific to vendors to be added.
A more generic 'vendor-specific' method is introduced in this series,
guiding users to refer to vendor specific documentation and system
logs, additonal indicators for detailed vendor specific recovery mechanism.
It is the responsibility of the consumer to refer to the correct vendor
specific documentation and usecase before attempting a recovery.
For example: If driver is XE KMD, the consumer must refer
to the documentation of 'Device Wedging' under 'Documentation/gpu/xe/'
The necessity of a firmware flash in Xe KMD is notified to the user with a
combination of vendor-specific wedged uevent, runtime survivability
mode and dmesg logs. Consumer must check both uevent and runtime
survivability sysfs before triggering a firmware update.
Udev
$ udevadm monitor --property --kernel
monitor will print the received events for:
KERNEL - the kernel uevent
KERNEL[754.709341] change /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 (drm)
ACTION=change
DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0
SUBSYSTEM=drm
WEDGED=vendor-specific
DEVNAME=/dev/dri/card0
DEVTYPE=drm_minor
SEQNUM=5973
MAJOR=226
MINOR=0
Dmesg:
xe 0000:03:00.0: [drm] *ERROR* [Hardware Error]: Tile0 reported NONFATAL error 0x20000
xe 0000:03:00.0: [drm] *ERROR* [Hardware Error]: NONFATAL: HEC Uncorrected FW FD Corruption error reported, bit[2] is set
xe 0000:03:00.0: Runtime Survivability mode enabled
xe 0000:03:00.0: [drm] *ERROR* CRITICAL: Xe has declared device 0000:03:00.0 as wedged.
IOCTLs and executions are blocked. Only a rebind may clear the failure
Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new
xe 0000:03:00.0: [drm] device wedged, needs recovery
xe 0000:03:00.0: Firmware update required, Please refer to the userspace documentation for more details!
Runtime survivability Sysfs:
/sys/bus/pci/devices/<device>/survivability_mode
Bspec: 50875, 53073, 53074, 53075, 53076
IGT: https://patchwork.freedesktop.org/patch/660122/
fwupd PR: https://github.com/fwupd/fwupd/pull/9024
Rev2: add a fault injection for csc errors
fix review comments
Rev3: add a vendor-specific recovery method
add support for runtime survivability mode
enable runtime survivability mode when csc errors are reported
Rev4: refactor survivability code
Rev5: Add more documentation
add user friendly logs
remove checks for BMG if not necessary
fix other review comments
Riana Tauro (9):
drm: Add a vendor-specific recovery method to device wedged uevent
drm/xe: Set GT as wedged before sending wedged uevent
drm/xe: Add a helper function to set recovery method
drm/xe/xe_survivability: Refactor survivability mode
drm/xe/xe_survivability: Add support for Runtime survivability mode
drm/xe/doc: Document device wedged and runtime survivability
drm/xe: Add support to handle hardware errors
drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors
drm/xe/xe_hw_error: Add fault injection to trigger csc error handler
Documentation/gpu/drm-uapi.rst | 41 +++-
Documentation/gpu/xe/index.rst | 1 +
Documentation/gpu/xe/xe_device.rst | 10 +
Documentation/gpu/xe/xe_pcode.rst | 6 +-
drivers/gpu/drm/drm_drv.c | 2 +
drivers/gpu/drm/xe/Makefile | 1 +
drivers/gpu/drm/xe/regs/xe_gsc_regs.h | 2 +
drivers/gpu/drm/xe/regs/xe_hw_error_regs.h | 20 ++
drivers/gpu/drm/xe/regs/xe_irq_regs.h | 1 +
drivers/gpu/drm/xe/xe_debugfs.c | 3 +
drivers/gpu/drm/xe/xe_device.c | 58 +++++-
drivers/gpu/drm/xe/xe_device.h | 1 +
drivers/gpu/drm/xe/xe_device_types.h | 5 +
drivers/gpu/drm/xe/xe_heci_gsc.c | 2 +-
drivers/gpu/drm/xe/xe_hw_error.c | 181 ++++++++++++++++++
drivers/gpu/drm/xe/xe_hw_error.h | 15 ++
drivers/gpu/drm/xe/xe_irq.c | 4 +
drivers/gpu/drm/xe/xe_pci.c | 6 +-
drivers/gpu/drm/xe/xe_survivability_mode.c | 167 ++++++++++++----
drivers/gpu/drm/xe/xe_survivability_mode.h | 5 +-
.../gpu/drm/xe/xe_survivability_mode_types.h | 8 +
include/drm/drm_device.h | 4 +
22 files changed, 488 insertions(+), 55 deletions(-)
create mode 100644 Documentation/gpu/xe/xe_device.rst
create mode 100644 drivers/gpu/drm/xe/regs/xe_hw_error_regs.h
create mode 100644 drivers/gpu/drm/xe/xe_hw_error.c
create mode 100644 drivers/gpu/drm/xe/xe_hw_error.h
--
2.47.1
More information about the Intel-xe
mailing list