[PATCH v7 1/9] drm: Add a vendor-specific recovery method to drm device wedged uevent

Maxime Ripard mripard at kernel.org
Thu Jul 31 09:32:41 UTC 2025


Hi,

On Wed, Jul 30, 2025 at 07:33:01PM +0530, Riana Tauro wrote:
> On 7/28/2025 3:57 PM, Riana Tauro wrote:
> > Address the need for a recovery method (firmware flash on Firmware errors)
> > introduced in the later patches of Xe KMD.
> > Whenever XE KMD detects a firmware error, a firmware flash is required to
> > recover the device to normal operation.
> > 
> > The initial proposal to use 'firmware-flash' as a recovery method was
> > not applicable to other drivers and could cause multiple recovery
> > methods specific to vendors to be added.
> > To address this a more generic 'vendor-specific' method is introduced,
> > guiding users to refer to vendor specific documentation and system logs
> > for detailed vendor specific recovery procedure.
> > 
> > Add a recovery method 'WEDGED=vendor-specific' for such errors.
> > Vendors must provide additional recovery documentation if this method
> > is used.
> > 
> > It is the responsibility of the consumer to refer to the correct vendor
> > specific documentation and usecase before attempting a recovery.
> > 
> > For example: If driver is XE KMD, the consumer must refer
> > to the documentation of 'Device Wedging' under 'Documentation/gpu/xe/'.
> > 
> > Recovery script contributed by Raag.
> > 
> > v2: fix documentation (Raag)
> > v3: add more details to commit message (Sima, Rodrigo, Raag)
> >      add an example script to the documentation (Raag)
> > v4: use consistent naming (Raag)
> > v5: fix commit message
> > 
> > Cc: André Almeida <andrealmeid at igalia.com>
> > Cc: Christian König <christian.koenig at amd.com>
> > Cc: David Airlie <airlied at gmail.com>
> > Cc: Simona Vetter <simona.vetter at ffwll.ch>
> > Co-developed-by: Raag Jadav <raag.jadav at intel.com>
> > Signed-off-by: Raag Jadav <raag.jadav at intel.com>
> > Signed-off-by: Riana Tauro <riana.tauro at intel.com>
> > Reviewed-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
> 
> This patch needs an ack from drm to be merged.
> The rest of the series have RB's. Can someone please provide an ack ?
> 
> Cc: drm-misc maintainers
> 
> Thanks
> Riana
> 
> > ---
> >   Documentation/gpu/drm-uapi.rst | 42 ++++++++++++++++++++++++++++------
> >   drivers/gpu/drm/drm_drv.c      |  2 ++
> >   include/drm/drm_device.h       |  4 ++++
> >   3 files changed, 41 insertions(+), 7 deletions(-)
> > 
> > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > index 843facf01b2d..5691b29acde3 100644
> > --- a/Documentation/gpu/drm-uapi.rst
> > +++ b/Documentation/gpu/drm-uapi.rst
> > @@ -418,13 +418,15 @@ needed.
> >   Recovery
> >   --------
> > -Current implementation defines three recovery methods, out of which, drivers
> > +Current implementation defines four recovery methods, out of which, drivers
> >   can use any one, multiple or none. Method(s) of choice will be sent in the
> >   uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > -more side-effects. If driver is unsure about recovery or method is unknown
> > -(like soft/hard system reboot, firmware flashing, physical device replacement
> > -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > -will be sent instead.
> > +more side-effects. If recovery method is specific to vendor
> > +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> > +specific documentation for the recovery procedure. As an example if the driver
> > +is 'Xe' then the documentation for 'Device Wedging' of Xe driver needs to be
> > +referred for the recovery procedure. If driver is unsure about recovery or
> > +method is unknown, ``WEDGED=unknown`` will be sent instead.
> >   Userspace consumers can parse this event and attempt recovery as per the
> >   following expectations.
> > @@ -435,6 +437,7 @@ following expectations.
> >       none            optional telemetry collection
> >       rebind          unbind + bind driver
> >       bus-reset       unbind + bus reset/re-enumeration + bind
> > +    vendor-specific vendor specific recovery method
> >       unknown         consumer policy
> >       =============== ========================================
> > @@ -472,8 +475,12 @@ erroring out, all device memory should be unmapped and file descriptors should
> >   be closed to prevent leaks or undefined behaviour. The idea here is to clear the
> >   device of all user context beforehand and set the stage for a clean recovery.
> > -Example
> > --------
> > +For ``WEDGED=vendor-specific`` recovery method, it is the responsibility of the
> > +consumer to check the driver documentation and the usecase before attempting
> > +a recovery.
> > +
> > +Example - rebind
> > +----------------
> >   Udev rule::
> > @@ -491,6 +498,27 @@ Recovery script::
> >       echo -n $DEVICE > $DRIVER/unbind
> >       echo -n $DEVICE > $DRIVER/bind
> > +Example - vendor-specific
> > +-------------------------
> > +
> > +Udev rule::
> > +
> > +    SUBSYSTEM=="drm", ENV{WEDGED}=="vendor-specific", DEVPATH=="*/drm/card[0-9]",
> > +    RUN+="/path/to/vendor_specific_recovery.sh $env{DEVPATH}"
> > +
> > +Recovery script::
> > +
> > +    #!/bin/sh
> > +
> > +    DEVPATH=$(readlink -f /sys/$1/device)
> > +    DRIVERPATH=$(readlink -f $DEVPATH/driver)
> > +    DRIVER=$(basename $DRIVERPATH)
> > +
> > +    if [ "$DRIVER" = "xe" ]; then
> > +        # Refer XE documentation and check usecase and recovery procedure
> > +    fi
> > +
> > +

So I guess I'm not opposed to it on principle, but the documentation
really needs some work.

You should at least list the valid vendor specific options, and what
each mean exactly. Ideally, it should be a link to the datasheet/manual
detailing the recovery procedure, but if that's under NDA, at least a
reference to the document and section you need to look at to implement
it properly.

Or if that's still not doable, anything that tells you what to do
instead of "run a shell script we don't provide".

Also, we just discussed it with Sima on IRC, and she mentioned that we
probably want to have a vendor specific prefix for each vendor-specific
method.

Maxime
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 273 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20250731/47f1cf03/attachment.sig>


More information about the dri-devel mailing list