[PATCH v7 1/9] drm: Add a vendor-specific recovery method to drm device wedged uevent

Riana Tauro riana.tauro at intel.com
Fri Aug 8 08:02:52 UTC 2025


Hi Maxime/Sima

On 8/5/2025 8:27 PM, Rodrigo Vivi wrote:
> On Mon, Jul 28, 2025 at 03:57:51PM +0530, Riana Tauro wrote:
>> Address the need for a recovery method (firmware flash on Firmware errors)
>> introduced in the later patches of Xe KMD.
>> Whenever XE KMD detects a firmware error, a firmware flash is required to
>> recover the device to normal operation.
>>
>> The initial proposal to use 'firmware-flash' as a recovery method was
>> not applicable to other drivers and could cause multiple recovery
>> methods specific to vendors to be added.
>> To address this a more generic 'vendor-specific' method is introduced,
>> guiding users to refer to vendor specific documentation and system logs
>> for detailed vendor specific recovery procedure.
>>
>> Add a recovery method 'WEDGED=vendor-specific' for such errors.
>> Vendors must provide additional recovery documentation if this method
>> is used.
>>
>> It is the responsibility of the consumer to refer to the correct vendor
>> specific documentation and usecase before attempting a recovery.
>>
>> For example: If driver is XE KMD, the consumer must refer
>> to the documentation of 'Device Wedging' under 'Documentation/gpu/xe/'.
>>
>> Recovery script contributed by Raag.
>>
>> v2: fix documentation (Raag)
>> v3: add more details to commit message (Sima, Rodrigo, Raag)
>>      add an example script to the documentation (Raag)
>> v4: use consistent naming (Raag)
>> v5: fix commit message
>>
>> Cc: André Almeida <andrealmeid at igalia.com>
>> Cc: Christian König <christian.koenig at amd.com>
>> Cc: David Airlie <airlied at gmail.com>
>> Cc: Simona Vetter <simona.vetter at ffwll.ch>
> 
> Cc: Maxime Ripard <mripard at kernel.org>
> 
>> Co-developed-by: Raag Jadav <raag.jadav at intel.com>
>> Signed-off-by: Raag Jadav <raag.jadav at intel.com>
>> Signed-off-by: Riana Tauro <riana.tauro at intel.com>
>> Reviewed-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
>> ---
>>   Documentation/gpu/drm-uapi.rst | 42 ++++++++++++++++++++++++++++------
>>   drivers/gpu/drm/drm_drv.c      |  2 ++
>>   include/drm/drm_device.h       |  4 ++++
>>   3 files changed, 41 insertions(+), 7 deletions(-)
>>
>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>> index 843facf01b2d..5691b29acde3 100644
>> --- a/Documentation/gpu/drm-uapi.rst
>> +++ b/Documentation/gpu/drm-uapi.rst
>> @@ -418,13 +418,15 @@ needed.
>>   Recovery
>>   --------
>>   
>> -Current implementation defines three recovery methods, out of which, drivers
>> +Current implementation defines four recovery methods, out of which, drivers
>>   can use any one, multiple or none. Method(s) of choice will be sent in the
>>   uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
>> -more side-effects. If driver is unsure about recovery or method is unknown
>> -(like soft/hard system reboot, firmware flashing, physical device replacement
>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
>> -will be sent instead.
>> +more side-effects. If recovery method is specific to vendor
>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
>> +specific documentation for the recovery procedure. As an example if the driver
>> +is 'Xe' then the documentation for 'Device Wedging' of Xe driver needs to be
>> +referred for the recovery procedure. If driver is unsure about recovery or
>> +method is unknown, ``WEDGED=unknown`` will be sent instead.
> 
> What if instead of this we do something like:
> 
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -441,6 +441,29 @@ following expectations.
>       unknown         consumer policy
>       =============== ========================================
>   
> +Vendor-Specific Recovery
> +++++++++++++++++++++++++
> +
> +When ``WEDGED=vendor-specific`` is emitted, it indicates that the device requires a
> +recovery method that is *not standardized* and is specific to the hardware vendor.
> +
> +In this case, the vendor driver must provide detailed documentation describing
> +every single recovery possibilities and its processes. It needs to include:
> +
> +- Hints: Which of the following will be used to identify the
> +  specific device, and guide the administrator:
> +  + Sysfs, debugfs, tracepoints, or kernel logs (e.g., ``dmesg``)
> +- Explicit guidance: for any admin or userspace tools and scripts necessary
> +  to carry out recovery.
> +
> +**Example**:
> +    If the device uses the ``Xe`` driver, then administrators should consult the
> +    *"Device Wedging"* section of the Xe driver's documentation to determine
> +    the proper steps for recovery.
> +
> +Notes
> ++++++
> +
>   The only exception to this is ``WEDGED=none``, which signifies that the device
> 
> ----------------------
> 
> Maxime, is it any better?

Is the documentation suggested by Rodrigo okay? Any suggestions ?

Thanks
Riana
> 
> Thanks,
> Rodrigo.
> 
>>   
>>   Userspace consumers can parse this event and attempt recovery as per the
>>   following expectations.
>> @@ -435,6 +437,7 @@ following expectations.
>>       none            optional telemetry collection
>>       rebind          unbind + bind driver
>>       bus-reset       unbind + bus reset/re-enumeration + bind
>> +    vendor-specific vendor specific recovery method
>>       unknown         consumer policy
>>       =============== ========================================
>>   
>> @@ -472,8 +475,12 @@ erroring out, all device memory should be unmapped and file descriptors should
>>   be closed to prevent leaks or undefined behaviour. The idea here is to clear the
>>   device of all user context beforehand and set the stage for a clean recovery.
>>   
>> -Example
>> --------
>> +For ``WEDGED=vendor-specific`` recovery method, it is the responsibility of the
>> +consumer to check the driver documentation and the usecase before attempting
>> +a recovery.
>> +
>> +Example - rebind
>> +----------------
>>   
>>   Udev rule::
>>   
>> @@ -491,6 +498,27 @@ Recovery script::
>>       echo -n $DEVICE > $DRIVER/unbind
>>       echo -n $DEVICE > $DRIVER/bind
>>   
>> +Example - vendor-specific
>> +-------------------------
>> +
>> +Udev rule::
>> +
>> +    SUBSYSTEM=="drm", ENV{WEDGED}=="vendor-specific", DEVPATH=="*/drm/card[0-9]",
>> +    RUN+="/path/to/vendor_specific_recovery.sh $env{DEVPATH}"
>> +
>> +Recovery script::
>> +
>> +    #!/bin/sh
>> +
>> +    DEVPATH=$(readlink -f /sys/$1/device)
>> +    DRIVERPATH=$(readlink -f $DEVPATH/driver)
>> +    DRIVER=$(basename $DRIVERPATH)
>> +
>> +    if [ "$DRIVER" = "xe" ]; then
>> +        # Refer XE documentation and check usecase and recovery procedure
>> +    fi
>> +
>> +
>>   Customization
>>   -------------
>>   
>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>> index cdd591b11488..0ac723a46a91 100644
>> --- a/drivers/gpu/drm/drm_drv.c
>> +++ b/drivers/gpu/drm/drm_drv.c
>> @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
>>   		return "rebind";
>>   	case DRM_WEDGE_RECOVERY_BUS_RESET:
>>   		return "bus-reset";
>> +	case DRM_WEDGE_RECOVERY_VENDOR:
>> +		return "vendor-specific";
>>   	default:
>>   		return NULL;
>>   	}
>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
>> index a33aedd5e9ec..59fd3f4d5995 100644
>> --- a/include/drm/drm_device.h
>> +++ b/include/drm/drm_device.h
>> @@ -26,10 +26,14 @@ struct pci_controller;
>>    * Recovery methods for wedged device in order of less to more side-effects.
>>    * To be used with drm_dev_wedged_event() as recovery @method. Callers can
>>    * use any one, multiple (or'd) or none depending on their needs.
>> + *
>> + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
>> + * details.
>>    */
>>   #define DRM_WEDGE_RECOVERY_NONE		BIT(0)	/* optional telemetry collection */
>>   #define DRM_WEDGE_RECOVERY_REBIND	BIT(1)	/* unbind + bind driver */
>>   #define DRM_WEDGE_RECOVERY_BUS_RESET	BIT(2)	/* unbind + reset bus device + bind */
>> +#define DRM_WEDGE_RECOVERY_VENDOR	BIT(3)	/* vendor specific recovery method */
>>   
>>   /**
>>    * struct drm_wedge_task_info - information about the guilty task of a wedge dev
>> -- 
>> 2.47.1
>>



More information about the dri-devel mailing list