[PATCH v7 1/9] drm: Add a vendor-specific recovery method to drm device wedged uevent
Maxime Ripard
mripard at kernel.org
Tue Aug 19 11:21:29 UTC 2025
On Tue, Aug 05, 2025 at 10:57:24AM -0400, Rodrigo Vivi wrote:
> On Mon, Jul 28, 2025 at 03:57:51PM +0530, Riana Tauro wrote:
> > Address the need for a recovery method (firmware flash on Firmware errors)
> > introduced in the later patches of Xe KMD.
> > Whenever XE KMD detects a firmware error, a firmware flash is required to
> > recover the device to normal operation.
> >
> > The initial proposal to use 'firmware-flash' as a recovery method was
> > not applicable to other drivers and could cause multiple recovery
> > methods specific to vendors to be added.
> > To address this a more generic 'vendor-specific' method is introduced,
> > guiding users to refer to vendor specific documentation and system logs
> > for detailed vendor specific recovery procedure.
> >
> > Add a recovery method 'WEDGED=vendor-specific' for such errors.
> > Vendors must provide additional recovery documentation if this method
> > is used.
> >
> > It is the responsibility of the consumer to refer to the correct vendor
> > specific documentation and usecase before attempting a recovery.
> >
> > For example: If driver is XE KMD, the consumer must refer
> > to the documentation of 'Device Wedging' under 'Documentation/gpu/xe/'.
> >
> > Recovery script contributed by Raag.
> >
> > v2: fix documentation (Raag)
> > v3: add more details to commit message (Sima, Rodrigo, Raag)
> > add an example script to the documentation (Raag)
> > v4: use consistent naming (Raag)
> > v5: fix commit message
> >
> > Cc: André Almeida <andrealmeid at igalia.com>
> > Cc: Christian König <christian.koenig at amd.com>
> > Cc: David Airlie <airlied at gmail.com>
> > Cc: Simona Vetter <simona.vetter at ffwll.ch>
>
> Cc: Maxime Ripard <mripard at kernel.org>
>
> > Co-developed-by: Raag Jadav <raag.jadav at intel.com>
> > Signed-off-by: Raag Jadav <raag.jadav at intel.com>
> > Signed-off-by: Riana Tauro <riana.tauro at intel.com>
> > Reviewed-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > ---
> > Documentation/gpu/drm-uapi.rst | 42 ++++++++++++++++++++++++++++------
> > drivers/gpu/drm/drm_drv.c | 2 ++
> > include/drm/drm_device.h | 4 ++++
> > 3 files changed, 41 insertions(+), 7 deletions(-)
> >
> > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > index 843facf01b2d..5691b29acde3 100644
> > --- a/Documentation/gpu/drm-uapi.rst
> > +++ b/Documentation/gpu/drm-uapi.rst
> > @@ -418,13 +418,15 @@ needed.
> > Recovery
> > --------
> >
> > -Current implementation defines three recovery methods, out of which, drivers
> > +Current implementation defines four recovery methods, out of which, drivers
> > can use any one, multiple or none. Method(s) of choice will be sent in the
> > uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > -more side-effects. If driver is unsure about recovery or method is unknown
> > -(like soft/hard system reboot, firmware flashing, physical device replacement
> > -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > -will be sent instead.
> > +more side-effects. If recovery method is specific to vendor
> > +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> > +specific documentation for the recovery procedure. As an example if the driver
> > +is 'Xe' then the documentation for 'Device Wedging' of Xe driver needs to be
> > +referred for the recovery procedure. If driver is unsure about recovery or
> > +method is unknown, ``WEDGED=unknown`` will be sent instead.
>
> What if instead of this we do something like:
>
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -441,6 +441,29 @@ following expectations.
> unknown consumer policy
> =============== ========================================
>
> +Vendor-Specific Recovery
> +++++++++++++++++++++++++
> +
> +When ``WEDGED=vendor-specific`` is emitted, it indicates that the device requires a
> +recovery method that is *not standardized* and is specific to the hardware vendor.
> +
> +In this case, the vendor driver must provide detailed documentation describing
> +every single recovery possibilities and its processes. It needs to include:
> +
> +- Hints: Which of the following will be used to identify the
> + specific device, and guide the administrator:
> + + Sysfs, debugfs, tracepoints, or kernel logs (e.g., ``dmesg``)
> +- Explicit guidance: for any admin or userspace tools and scripts necessary
> + to carry out recovery.
> +
> +**Example**:
> + If the device uses the ``Xe`` driver, then administrators should consult the
> + *"Device Wedging"* section of the Xe driver's documentation to determine
> + the proper steps for recovery.
> +
> +Notes
> ++++++
> +
> The only exception to this is ``WEDGED=none``, which signifies that the device
>
> ----------------------
>
> Maxime, is it any better?
Yes, it is. Thanks!
Maxime
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 273 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/intel-xe/attachments/20250819/40f472d6/attachment.sig>
More information about the Intel-xe
mailing list