[PATCH v7 1/9] drm: Add a vendor-specific recovery method to drm device wedged uevent

Riana Tauro riana.tauro at intel.com
Thu Jul 31 13:48:19 UTC 2025



On 7/31/2025 6:31 PM, Maxime Ripard wrote:
> On Thu, Jul 31, 2025 at 04:43:46PM +0530, Riana Tauro wrote:
>> Hi Maxim
>>
>> On 7/31/2025 3:02 PM, Maxime Ripard wrote:
>>> Hi,
>>>
>>> On Wed, Jul 30, 2025 at 07:33:01PM +0530, Riana Tauro wrote:
>>>> On 7/28/2025 3:57 PM, Riana Tauro wrote:
>>>>> Address the need for a recovery method (firmware flash on Firmware errors)
>>>>> introduced in the later patches of Xe KMD.
>>>>> Whenever XE KMD detects a firmware error, a firmware flash is required to
>>>>> recover the device to normal operation.
>>>>>
>>>>> The initial proposal to use 'firmware-flash' as a recovery method was
>>>>> not applicable to other drivers and could cause multiple recovery
>>>>> methods specific to vendors to be added.
>>>>> To address this a more generic 'vendor-specific' method is introduced,
>>>>> guiding users to refer to vendor specific documentation and system logs
>>>>> for detailed vendor specific recovery procedure.
>>>>>
>>>>> Add a recovery method 'WEDGED=vendor-specific' for such errors.
>>>>> Vendors must provide additional recovery documentation if this method
>>>>> is used.
>>>>>
>>>>> It is the responsibility of the consumer to refer to the correct vendor
>>>>> specific documentation and usecase before attempting a recovery.
>>>>>
>>>>> For example: If driver is XE KMD, the consumer must refer
>>>>> to the documentation of 'Device Wedging' under 'Documentation/gpu/xe/'.
>>>>>
>>>>> Recovery script contributed by Raag.
>>>>>
>>>>> v2: fix documentation (Raag)
>>>>> v3: add more details to commit message (Sima, Rodrigo, Raag)
>>>>>        add an example script to the documentation (Raag)
>>>>> v4: use consistent naming (Raag)
>>>>> v5: fix commit message
>>>>>
>>>>> Cc: André Almeida <andrealmeid at igalia.com>
>>>>> Cc: Christian König <christian.koenig at amd.com>
>>>>> Cc: David Airlie <airlied at gmail.com>
>>>>> Cc: Simona Vetter <simona.vetter at ffwll.ch>
>>>>> Co-developed-by: Raag Jadav <raag.jadav at intel.com>
>>>>> Signed-off-by: Raag Jadav <raag.jadav at intel.com>
>>>>> Signed-off-by: Riana Tauro <riana.tauro at intel.com>
>>>>> Reviewed-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
>>>>
>>>> This patch needs an ack from drm to be merged.
>>>> The rest of the series have RB's. Can someone please provide an ack ?
>>>>
>>>> Cc: drm-misc maintainers
>>>>
>>>> Thanks
>>>> Riana
>>>>
>>>>> ---
>>>>>     Documentation/gpu/drm-uapi.rst | 42 ++++++++++++++++++++++++++++------
>>>>>     drivers/gpu/drm/drm_drv.c      |  2 ++
>>>>>     include/drm/drm_device.h       |  4 ++++
>>>>>     3 files changed, 41 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>>>>> index 843facf01b2d..5691b29acde3 100644
>>>>> --- a/Documentation/gpu/drm-uapi.rst
>>>>> +++ b/Documentation/gpu/drm-uapi.rst
>>>>> @@ -418,13 +418,15 @@ needed.
>>>>>     Recovery
>>>>>     --------
>>>>> -Current implementation defines three recovery methods, out of which, drivers
>>>>> +Current implementation defines four recovery methods, out of which, drivers
>>>>>     can use any one, multiple or none. Method(s) of choice will be sent in the
>>>>>     uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
>>>>> -more side-effects. If driver is unsure about recovery or method is unknown
>>>>> -(like soft/hard system reboot, firmware flashing, physical device replacement
>>>>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
>>>>> -will be sent instead.
>>>>> +more side-effects. If recovery method is specific to vendor
>>>>> +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
>>>>> +specific documentation for the recovery procedure. As an example if the driver
>>>>> +is 'Xe' then the documentation for 'Device Wedging' of Xe driver needs to be
>>>>> +referred for the recovery procedure. If driver is unsure about recovery or
>>>>> +method is unknown, ``WEDGED=unknown`` will be sent instead.
>>>>>     Userspace consumers can parse this event and attempt recovery as per the
>>>>>     following expectations.
>>>>> @@ -435,6 +437,7 @@ following expectations.
>>>>>         none            optional telemetry collection
>>>>>         rebind          unbind + bind driver
>>>>>         bus-reset       unbind + bus reset/re-enumeration + bind
>>>>> +    vendor-specific vendor specific recovery method
>>>>>         unknown         consumer policy
>>>>>         =============== ========================================
>>>>> @@ -472,8 +475,12 @@ erroring out, all device memory should be unmapped and file descriptors should
>>>>>     be closed to prevent leaks or undefined behaviour. The idea here is to clear the
>>>>>     device of all user context beforehand and set the stage for a clean recovery.
>>>>> -Example
>>>>> --------
>>>>> +For ``WEDGED=vendor-specific`` recovery method, it is the responsibility of the
>>>>> +consumer to check the driver documentation and the usecase before attempting
>>>>> +a recovery.
>>>>> +
>>>>> +Example - rebind
>>>>> +----------------
>>>>>     Udev rule::
>>>>> @@ -491,6 +498,27 @@ Recovery script::
>>>>>         echo -n $DEVICE > $DRIVER/unbind
>>>>>         echo -n $DEVICE > $DRIVER/bind
>>>>> +Example - vendor-specific
>>>>> +-------------------------
>>>>> +
>>>>> +Udev rule::
>>>>> +
>>>>> +    SUBSYSTEM=="drm", ENV{WEDGED}=="vendor-specific", DEVPATH=="*/drm/card[0-9]",
>>>>> +    RUN+="/path/to/vendor_specific_recovery.sh $env{DEVPATH}"
>>>>> +
>>>>> +Recovery script::
>>>>> +
>>>>> +    #!/bin/sh
>>>>> +
>>>>> +    DEVPATH=$(readlink -f /sys/$1/device)
>>>>> +    DRIVERPATH=$(readlink -f $DEVPATH/driver)
>>>>> +    DRIVER=$(basename $DRIVERPATH)
>>>>> +
>>>>> +    if [ "$DRIVER" = "xe" ]; then
>>>>> +        # Refer XE documentation and check usecase and recovery procedure
>>>>> +    fi
>>>>> +
>>>>> +
>>>
>>> So I guess I'm not opposed to it on principle, but the documentation
>>> really needs some work.
>>>
>>> You should at least list the valid vendor specific options, and what
>>> each mean exactly. Ideally, it should be a link to the datasheet/manual
>>> detailing the recovery procedure,
>>
>> This is added above
>>
>> "If recovery method is specific to vendor ``WEDGED=vendor-specific`` will be
>> sent and userspace should refer to vendor specific documentation for the
>> recovery procedure. As an example if the driver is 'Xe' then the
>> documentation for 'Device Wedging' of Xe driver needs to be referred for the
>> recovery procedure."
>>
>> The documentation of Xe is in Patch 6
>>
>> https://lore.kernel.org/intel-xe/20250728102809.502324-7-riana.tauro@intel.com/
> 
> I'm sorry, I still don't get how, as a user, I can reimplement what that
> tool is supposed to be doing. Or do you anticipate that there's only
> ever be a single way to recover a Xe device, which is to reflash the
> firmware?

As suggested, it is a combination of uevent with an additional vendor 
specific indicator. In this case, along with the uevent there is a sysfs 
and dmesg that indicates firmware flash is the required

In the future, if there is another vendor-specific recovery needed then 
same will be indicated in dmesg/sysfs or any vendor specific way and the 
case will be documented

> 
> What if in ~5y, Intel comes up with a new recovery method for the newer
> models?

That will be documented in the Xe Device Wedging documentation as 
mentioned above

> 
>> I'll add the link instead of just the chapter name
>>> but if that's under NDA, at least a
>>> reference to the document and section you need to look at to implement
>>> it properly.
>>>
>>> Or if that's still not doable, anything that tells you what to do
>>> instead of "run a shell script we don't provide".
>>>
>>> Also, we just discussed it with Sima on IRC, and she mentioned that we
>>> probably want to have a vendor specific prefix for each vendor-specific
>>> method.
>>
>> This was discussed as part of Rev4
>>
>> https://lore.kernel.org/intel-xe/aG-U9JTXDah_tu1U@black.fi.intel.com/
>>
>> DEVPATH from uevent and driver should be able to identify the driver.
>> Shouldn't that be enough?
> 
> See above. What happens if we start to see systems with two Xe GPUs, one
> with a new recovery method and one with an old recovery method?


The DEV_PATH will indicate the device and the vendor-specific indicator 
like dmesg/sysfs will indicate the recovery procedure

Thanks
Riana

> 
> Maxime



More information about the Intel-xe mailing list