[PATCH v8 02/10] drm: Add a vendor-specific recovery method to drm device wedged uevent
Rodrigo Vivi
rodrigo.vivi at intel.com
Thu Aug 21 15:17:13 UTC 2025
On Thu, Aug 21, 2025 at 08:01:39PM +0530, Riana Tauro wrote:
> Hi Maxime
>
> This patch has the changes suggested wrt to documentation in v7. Have added
> whatever Rodrigo suggested in the doc. Please let us know if the changes are
> okay and if the patch can be merged.
Maxime already told he is okay with the changes:
https://lore.kernel.org/dri-devel/5hkngbuzoryldvjrtjwalxhosdhtweeinpjpyguzltjmee7mpu@vw44iwczytw5/
>
> This needs a drm-misc maintainer ack to go ahead.
But we still need some formal ack here to move ahead with this patch indeed.
>
> Thanks
> Riana
>
> On 8/17/2025 9:55 PM, Rodrigo Vivi wrote:
> > On Thu, Aug 14, 2025 at 05:44:32PM +0530, Riana Tauro wrote:
> > > Address the need for a recovery method (firmware flash on Firmware errors)
> > > introduced in the later patches of Xe KMD.
> > > Whenever XE KMD detects a firmware error, a firmware flash is required to
> > > recover the device to normal operation.
> > >
> > > The initial proposal to use 'firmware-flash' as a recovery method was
> > > not applicable to other drivers and could cause multiple recovery
> > > methods specific to vendors to be added.
> > > To address this a more generic 'vendor-specific' method is introduced,
> > > guiding users to refer to vendor specific documentation and system logs
> > > for detailed vendor specific recovery procedure.
> > >
> > > Add a recovery method 'WEDGED=vendor-specific' for such errors.
> > > Vendors must provide additional recovery documentation if this method
> > > is used.
> > >
> > > It is the responsibility of the consumer to refer to the correct vendor
> > > specific documentation and usecase before attempting a recovery.
> > >
> > > For example: If driver is XE KMD, the consumer must refer
> > > to the documentation of 'Device Wedging' under 'Documentation/gpu/xe/'.
> > >
> > > v2: fix documentation (Raag)
> > > v3: add more details to commit message (Sima, Rodrigo, Raag)
> > > add an example script to the documentation (Raag)
> > > v4: use consistent naming (Raag)
> > > v5: fix commit message
> > > v6: add more documentation
> > >
> > > Cc: André Almeida <andrealmeid at igalia.com>
> > > Cc: Christian König <christian.koenig at amd.com>
> > > Cc: David Airlie <airlied at gmail.com>
> > > Cc: Simona Vetter <simona.vetter at ffwll.ch>
> >
> > Cc: Maxime Ripard <mripard at kernel.org>
> >
> > Folks, is it clear now? can we move ahead and get this through drm-xe-next?
> >
> > > Signed-off-by: Raag Jadav <raag.jadav at intel.com>
> > > Signed-off-by: Riana Tauro <riana.tauro at intel.com>
> > > Reviewed-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > > ---
> > > Documentation/gpu/drm-uapi.rst | 47 +++++++++++++++++++++++++++++-----
> > > drivers/gpu/drm/drm_drv.c | 2 ++
> > > include/drm/drm_device.h | 4 +++
> > > 3 files changed, 46 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > index 843facf01b2d..669a6b9da0b2 100644
> > > --- a/Documentation/gpu/drm-uapi.rst
> > > +++ b/Documentation/gpu/drm-uapi.rst
> > > @@ -418,13 +418,12 @@ needed.
> > > Recovery
> > > --------
> > > -Current implementation defines three recovery methods, out of which, drivers
> > > +Current implementation defines four recovery methods, out of which, drivers
> > > can use any one, multiple or none. Method(s) of choice will be sent in the
> > > uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > > -more side-effects. If driver is unsure about recovery or method is unknown
> > > -(like soft/hard system reboot, firmware flashing, physical device replacement
> > > -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > > -will be sent instead.
> > > +more side-effects. See the section `Vendor Specific Recovery`_
> > > +for ``WEDGED=vendor-specific``. If driver is unsure about recovery or
> > > +method is unknown, ``WEDGED=unknown`` will be sent instead.
> > > Userspace consumers can parse this event and attempt recovery as per the
> > > following expectations.
> > > @@ -435,6 +434,7 @@ following expectations.
> > > none optional telemetry collection
> > > rebind unbind + bind driver
> > > bus-reset unbind + bus reset/re-enumeration + bind
> > > + vendor-specific vendor specific recovery method
> > > unknown consumer policy
> > > =============== ========================================
> > > @@ -446,6 +446,35 @@ telemetry information (devcoredump, syslog). This is useful because the first
> > > hang is usually the most critical one which can result in consequential hangs or
> > > complete wedging.
> > > +
> > > +Vendor Specific Recovery
> > > +------------------------
> > > +
> > > +When ``WEDGED=vendor-specific`` is sent, it indicates that the device requires
> > > +a recovery procedure specific to the hardware vendor and is not one of the
> > > +standardized approaches.
> > > +
> > > +``WEDGED=vendor-specific`` may be used to indicate different cases within a
> > > +single vendor driver, each requiring a distinct recovery procedure.
> > > +In such scenarios, the vendor driver must provide comprehensive documentation
> > > +that describes each case, includes additional hints to identify specific case and
> > > +outlines the corresponding recovery procedures. The documentation includes:
> > > +
> > > +Case - A list of all cases that sends the ``WEDGED=vendor-specific`` recovery method.
> > > +
> > > +Hints - Additional Information to assist the userspace consumer in identifying and
> > > +differentiating between different cases. This can be exposed through sysfs, debugfs,
> > > +traces, dmesg etc.
> > > +
> > > +Recovery Procedure - Clear instructions and guidance for recovering each case.
> > > +This may include userspace scripts, tools needed for the recovery procedure.
> > > +
> > > +It is the responsibility of the admin/userspace consumer to identify the case and
> > > +verify additional identification hints before attempting a recovery procedure.
> > > +
> > > +Example: If the device uses the Xe driver, then userspace consumer should refer to
> > > +:ref:`Xe Device Wedging <xe-device-wedging>` for the detailed documentation.
> > > +
> > > Task information
> > > ----------------
> > > @@ -472,8 +501,12 @@ erroring out, all device memory should be unmapped and file descriptors should
> > > be closed to prevent leaks or undefined behaviour. The idea here is to clear the
> > > device of all user context beforehand and set the stage for a clean recovery.
> > > -Example
> > > --------
> > > +For ``WEDGED=vendor-specific`` recovery method, it is the responsibility of the
> > > +consumer to check the driver documentation and the usecase before attempting
> > > +a recovery.
> > > +
> > > +Example - rebind
> > > +----------------
> > > Udev rule::
> > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > > index cdd591b11488..0ac723a46a91 100644
> > > --- a/drivers/gpu/drm/drm_drv.c
> > > +++ b/drivers/gpu/drm/drm_drv.c
> > > @@ -532,6 +532,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> > > return "rebind";
> > > case DRM_WEDGE_RECOVERY_BUS_RESET:
> > > return "bus-reset";
> > > + case DRM_WEDGE_RECOVERY_VENDOR:
> > > + return "vendor-specific";
> > > default:
> > > return NULL;
> > > }
> > > diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> > > index a33aedd5e9ec..59fd3f4d5995 100644
> > > --- a/include/drm/drm_device.h
> > > +++ b/include/drm/drm_device.h
> > > @@ -26,10 +26,14 @@ struct pci_controller;
> > > * Recovery methods for wedged device in order of less to more side-effects.
> > > * To be used with drm_dev_wedged_event() as recovery @method. Callers can
> > > * use any one, multiple (or'd) or none depending on their needs.
> > > + *
> > > + * Refer to "Device Wedging" chapter in Documentation/gpu/drm-uapi.rst for more
> > > + * details.
> > > */
> > > #define DRM_WEDGE_RECOVERY_NONE BIT(0) /* optional telemetry collection */
> > > #define DRM_WEDGE_RECOVERY_REBIND BIT(1) /* unbind + bind driver */
> > > #define DRM_WEDGE_RECOVERY_BUS_RESET BIT(2) /* unbind + reset bus device + bind */
> > > +#define DRM_WEDGE_RECOVERY_VENDOR BIT(3) /* vendor specific recovery method */
> > > /**
> > > * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> > > --
> > > 2.47.1
> > >
>
More information about the Intel-xe
mailing list