[PATCH v7 1/9] drm: Add a vendor-specific recovery method to drm device wedged uevent
Maxime Ripard
mripard at kernel.org
Thu Jul 31 13:01:18 UTC 2025
On Thu, Jul 31, 2025 at 04:43:46PM +0530, Riana Tauro wrote:
> Hi Maxim
>
> On 7/31/2025 3:02 PM, Maxime Ripard wrote:
> > Hi,
> >
> > On Wed, Jul 30, 2025 at 07:33:01PM +0530, Riana Tauro wrote:
> > > On 7/28/2025 3:57 PM, Riana Tauro wrote:
> > > > Address the need for a recovery method (firmware flash on Firmware errors)
> > > > introduced in the later patches of Xe KMD.
> > > > Whenever XE KMD detects a firmware error, a firmware flash is required to
> > > > recover the device to normal operation.
> > > >
> > > > The initial proposal to use 'firmware-flash' as a recovery method was
> > > > not applicable to other drivers and could cause multiple recovery
> > > > methods specific to vendors to be added.
> > > > To address this a more generic 'vendor-specific' method is introduced,
> > > > guiding users to refer to vendor specific documentation and system logs
> > > > for detailed vendor specific recovery procedure.
> > > >
> > > > Add a recovery method 'WEDGED=vendor-specific' for such errors.
> > > > Vendors must provide additional recovery documentation if this method
> > > > is used.
> > > >
> > > > It is the responsibility of the consumer to refer to the correct vendor
> > > > specific documentation and usecase before attempting a recovery.
> > > >
> > > > For example: If driver is XE KMD, the consumer must refer
> > > > to the documentation of 'Device Wedging' under 'Documentation/gpu/xe/'.
> > > >
> > > > Recovery script contributed by Raag.
> > > >
> > > > v2: fix documentation (Raag)
> > > > v3: add more details to commit message (Sima, Rodrigo, Raag)
> > > > add an example script to the documentation (Raag)
> > > > v4: use consistent naming (Raag)
> > > > v5: fix commit message
> > > >
> > > > Cc: André Almeida <andrealmeid at igalia.com>
> > > > Cc: Christian König <christian.koenig at amd.com>
> > > > Cc: David Airlie <airlied at gmail.com>
> > > > Cc: Simona Vetter <simona.vetter at ffwll.ch>
> > > > Co-developed-by: Raag Jadav <raag.jadav at intel.com>
> > > > Signed-off-by: Raag Jadav <raag.jadav at intel.com>
> > > > Signed-off-by: Riana Tauro <riana.tauro at intel.com>
> > > > Reviewed-by: Rodrigo Vivi <rodrigo.vivi at intel.com>
> > >
> > > This patch needs an ack from drm to be merged.
> > > The rest of the series have RB's. Can someone please provide an ack ?
> > >
> > > Cc: drm-misc maintainers
> > >
> > > Thanks
> > > Riana
> > >
> > > > ---
> > > > Documentation/gpu/drm-uapi.rst | 42 ++++++++++++++++++++++++++++------
> > > > drivers/gpu/drm/drm_drv.c | 2 ++
> > > > include/drm/drm_device.h | 4 ++++
> > > > 3 files changed, 41 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > > index 843facf01b2d..5691b29acde3 100644
> > > > --- a/Documentation/gpu/drm-uapi.rst
> > > > +++ b/Documentation/gpu/drm-uapi.rst
> > > > @@ -418,13 +418,15 @@ needed.
> > > > Recovery
> > > > --------
> > > > -Current implementation defines three recovery methods, out of which, drivers
> > > > +Current implementation defines four recovery methods, out of which, drivers
> > > > can use any one, multiple or none. Method(s) of choice will be sent in the
> > > > uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> > > > -more side-effects. If driver is unsure about recovery or method is unknown
> > > > -(like soft/hard system reboot, firmware flashing, physical device replacement
> > > > -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> > > > -will be sent instead.
> > > > +more side-effects. If recovery method is specific to vendor
> > > > +``WEDGED=vendor-specific`` will be sent and userspace should refer to vendor
> > > > +specific documentation for the recovery procedure. As an example if the driver
> > > > +is 'Xe' then the documentation for 'Device Wedging' of Xe driver needs to be
> > > > +referred for the recovery procedure. If driver is unsure about recovery or
> > > > +method is unknown, ``WEDGED=unknown`` will be sent instead.
> > > > Userspace consumers can parse this event and attempt recovery as per the
> > > > following expectations.
> > > > @@ -435,6 +437,7 @@ following expectations.
> > > > none optional telemetry collection
> > > > rebind unbind + bind driver
> > > > bus-reset unbind + bus reset/re-enumeration + bind
> > > > + vendor-specific vendor specific recovery method
> > > > unknown consumer policy
> > > > =============== ========================================
> > > > @@ -472,8 +475,12 @@ erroring out, all device memory should be unmapped and file descriptors should
> > > > be closed to prevent leaks or undefined behaviour. The idea here is to clear the
> > > > device of all user context beforehand and set the stage for a clean recovery.
> > > > -Example
> > > > --------
> > > > +For ``WEDGED=vendor-specific`` recovery method, it is the responsibility of the
> > > > +consumer to check the driver documentation and the usecase before attempting
> > > > +a recovery.
> > > > +
> > > > +Example - rebind
> > > > +----------------
> > > > Udev rule::
> > > > @@ -491,6 +498,27 @@ Recovery script::
> > > > echo -n $DEVICE > $DRIVER/unbind
> > > > echo -n $DEVICE > $DRIVER/bind
> > > > +Example - vendor-specific
> > > > +-------------------------
> > > > +
> > > > +Udev rule::
> > > > +
> > > > + SUBSYSTEM=="drm", ENV{WEDGED}=="vendor-specific", DEVPATH=="*/drm/card[0-9]",
> > > > + RUN+="/path/to/vendor_specific_recovery.sh $env{DEVPATH}"
> > > > +
> > > > +Recovery script::
> > > > +
> > > > + #!/bin/sh
> > > > +
> > > > + DEVPATH=$(readlink -f /sys/$1/device)
> > > > + DRIVERPATH=$(readlink -f $DEVPATH/driver)
> > > > + DRIVER=$(basename $DRIVERPATH)
> > > > +
> > > > + if [ "$DRIVER" = "xe" ]; then
> > > > + # Refer XE documentation and check usecase and recovery procedure
> > > > + fi
> > > > +
> > > > +
> >
> > So I guess I'm not opposed to it on principle, but the documentation
> > really needs some work.
> >
> > You should at least list the valid vendor specific options, and what
> > each mean exactly. Ideally, it should be a link to the datasheet/manual
> > detailing the recovery procedure,
>
> This is added above
>
> "If recovery method is specific to vendor ``WEDGED=vendor-specific`` will be
> sent and userspace should refer to vendor specific documentation for the
> recovery procedure. As an example if the driver is 'Xe' then the
> documentation for 'Device Wedging' of Xe driver needs to be referred for the
> recovery procedure."
>
> The documentation of Xe is in Patch 6
>
> https://lore.kernel.org/intel-xe/20250728102809.502324-7-riana.tauro@intel.com/
I'm sorry, I still don't get how, as a user, I can reimplement what that
tool is supposed to be doing. Or do you anticipate that there's only
ever be a single way to recover a Xe device, which is to reflash the
firmware?
What if in ~5y, Intel comes up with a new recovery method for the newer
models?
> I'll add the link instead of just the chapter name
> > but if that's under NDA, at least a
> > reference to the document and section you need to look at to implement
> > it properly.
> >
> > Or if that's still not doable, anything that tells you what to do
> > instead of "run a shell script we don't provide".
> >
> > Also, we just discussed it with Sima on IRC, and she mentioned that we
> > probably want to have a vendor specific prefix for each vendor-specific
> > method.
>
> This was discussed as part of Rev4
>
> https://lore.kernel.org/intel-xe/aG-U9JTXDah_tu1U@black.fi.intel.com/
>
> DEVPATH from uevent and driver should be able to identify the driver.
> Shouldn't that be enough?
See above. What happens if we start to see systems with two Xe GPUs, one
with a new recovery method and one with an old recovery method?
Maxime
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 273 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/intel-xe/attachments/20250731/3e962d5f/attachment.sig>
More information about the Intel-xe
mailing list