[PATCH v9 2/4] drm/doc: Document device wedged event
Christian König
christian.koenig at amd.com
Fri Nov 15 09:19:42 UTC 2024
Am 15.11.24 um 06:07 schrieb Raag Jadav:
> Add documentation for device wedged event in a new 'Device wedging'
> chapter. The describes basic definitions and consumer expectations
> along with an example.
>
> v8: Improve documentation (Christian, Rodrigo)
> v9: Add prerequisites section (Christian)
>
> Signed-off-by: Raag Jadav <raag.jadav at intel.com>
Sounds totally sane to me, but I'm not a native speaker of English so
other should probably look at it as well.
Anyway feel free to add Reviewed-by: Christian König
<christian.koenig at amd.com>.
Regards,
Christian.
> ---
> Documentation/gpu/drm-uapi.rst | 102 ++++++++++++++++++++++++++++++++-
> 1 file changed, 99 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index b75cc9a70d1f..33d9c253d4d6 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -371,9 +371,105 @@ Reporting causes of resets
>
> Apart from propagating the reset through the stack so apps can recover, it's
> really useful for driver developers to learn more about what caused the reset in
> -the first place. DRM devices should make use of devcoredump to store relevant
> -information about the reset, so this information can be added to user bug
> -reports.
> +the first place. For this, drivers can make use of devcoredump to store relevant
> +information about the reset and send device wedged event without recovery method
> +(as explained in next chapter) to notify userspace, so this information can be
> +collected and added to user bug reports.
> +
> +Device wedging
> +==============
> +
> +Drivers can optionally make use of device wedged event (implemented as
> +drm_dev_wedged_event() in DRM subsystem), which notifies userspace of 'wedged'
> +(hanged/unusable) state of the DRM device through a uevent. This is useful
> +especially in cases where the device is no longer operating as expected and
> +has become unrecoverable from driver context. Purpose of this implementation
> +is to provide drivers a generic way to recover with the help of userspace
> +intervention without taking any drastic measures in the driver.
> +
> +A 'wedged' device is basically a dead device that needs attention. The
> +uevent is the notification that is sent to userspace along with a hint about
> +what could possibly be attempted to recover the device and bring it back to
> +usable state. Different drivers may have different ideas of a 'wedged' device
> +depending on their hardware implementation, and hence the vendor agnostic
> +nature of the event. It is up to the drivers to decide when they see the need
> +for recovery and how they want to recover from the available methods.
> +
> +Prerequisites
> +-------------
> +
> +The driver, before opting for recovery, needs to make sure that the 'wedged'
> +device doesn't harm the system as a whole by taking care of the prerequisites.
> +Necessary actions must include disabling DMA to system memory as well as any
> +communication channels with other devices. Further, the driver must ensure
> +that all dma_fences are signalled and any device state that the core kernel
> +might depend on are cleaned up. Once the event is sent, the device must be
> +kept in 'wedged' state until the recovery is performed. New accesses to the
> +device (IOCTLs) should be blocked, preferably with an error code that
> +resembles the type of failure the device has encountered. This will signify
> +the reason for wegeding which can be reported to the application if needed.
> +
> +Recovery
> +--------
> +
> +Current implementation defines three recovery methods, out of which, drivers
> +can use any one, multiple or none. Method(s) of choice will be sent in the
> +uevent environment as ``WEDGED=<method1>[,<method2>]`` in order of less to
> +more side-effects. If driver is unsure about recovery or method is unknown
> +(like soft/hard reboot, firmware flashing, hardware replacement or any other
> +procedure which can't be attempted on the fly), ``WEDGED=unknown`` will be
> +sent instead.
> +
> +Userspace consumers can parse this event and attempt recovery as per the
> +following expectations.
> +
> + =============== ================================
> + Recovery method Consumer expectations
> + =============== ================================
> + none optional telemetry collection
> + rebind unbind + bind driver
> + bus-reset unbind + reset bus device + bind
> + unknown admin/user policy
> + =============== ================================
> +
> +The only exception to this is ``WEDGED=none``, which signifies that the
> +device was temporarily 'wedged' at some point but was able to recover using
> +device specific methods like reset. No explicit action is expected from
> +userspace consumers in this case, but they can still take additional steps
> +like gathering telemetry information (devcoredump, syslog). This is useful
> +because the first hang is usually the most critical one which can result in
> +consequential hangs or complete wedging.
> +
> +Example
> +-------
> +
> +Udev rule::
> +
> + SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]",
> + RUN+="/path/to/rebind.sh $env{DEVPATH}"
> +
> +Recovery script::
> +
> + #!/bin/sh
> +
> + DEVPATH=$(readlink -f /sys/$1/device)
> + DEVICE=$(basename $DEVPATH)
> + DRIVER=$(readlink -f $DEVPATH/driver)
> +
> + echo -n $DEVICE > $DRIVER/unbind
> + sleep 1
> + echo -n $DEVICE > $DRIVER/bind
> +
> +Customization
> +-------------
> +
> +Although basic recovery is possible with a simple script, admin/users can
> +define custom policies around recovery action. For example, if the driver
> +supports multiple recovery methods, consumers can opt for the suitable one
> +based on policy definition. Consumers can also choose to have the device
> +available for debugging or additional data collection before performing the
> +recovery. This is useful especially when the driver is unsure about recovery
> +or method is unknown.
>
> .. _drm_driver_ioctl:
>
More information about the Intel-xe
mailing list