[PATCH v5 1/1] drm/doc: Document DRM device reset expectations
Marek Olšák
maraeo at gmail.com
Tue Jun 27 18:57:25 UTC 2023
On Tue, Jun 27, 2023, 09:23 André Almeida <andrealmeid at igalia.com> wrote:
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Acked-by: Pekka Paalanen <pekka.paalanen at collabora.com>
> Signed-off-by: André Almeida <andrealmeid at igalia.com>
> ---
>
> v4:
> https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
>
> Changes:
> - Grammar fixes (Randy)
>
> Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> 1 file changed, 68 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst
> b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3cbffa25ed93 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third
> handler for
> mmapped regular files. Threads cause additional pain with signal
> handling as well.
>
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware
> bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again.
> This
> +sections describes the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to
> perform
> +it as needed. Usually a hang is detected when a job gets stuck executing.
> KMD
> +should keep track of resets, because userspace can query any time about
> the
> +reset stats for an specific context. This is needed to propagate to the
> rest of
> +the stack that a reset has happened. Currently, this is implemented by
> each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if the
> device has
> +been reset, and this can be checked more often if the UMD requires it.
> After
> +detecting a reset, UMD will then proceed to report it to the application
> using
> +the appropriate API error code, as explained in the section below about
> +robustness.
>
The UMD won't check the device status before every command submission due
to ioctl overhead. Instead, the KMD should skip command submission and
return an error that it was skipped.
The only case where that won't be applicable is user queues where drivers
don't call into the kernel to submit work, but they do call into the kernel
to create a dma_fence. In that case, the call to create a dma_fence can
fail with an error.
Marek
+
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is
> using.
> +
> +Graphical APIs provide ways to applications to deal with device resets.
> However,
> +there is no guarantee that the app will use such features correctly, and
> the
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keep
> blocking
> +the user interface from being correctly displayed. This should be done
> even if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES).
> This
> +interface tells if a reset has happened, and if so, all the context state
> is
> +considered lost and the app proceeds by creating new ones. If it is
> possible to
> +determine that robustness is not in use, the UMD will terminate the app
> when a
> +reset is detected, giving that the contexts are lost and the app won't be
> able
> +to figure this out and recreate the contexts.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for
> submissions.
> +This error code means, among other things, that a device reset has
> happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--------------------------
> +
> +Apart from propagating the reset through the stack so apps can recover,
> it's
> +really useful for driver developers to learn more about what caused the
> reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.
> +
> .. _drm_driver_ioctl:
>
> IOCTL Support on Device Nodes
> --
> 2.41.0
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20230627/6e7b1a50/attachment.htm>
More information about the amd-gfx
mailing list