[PATCH v2 1/1] drm/doc: Document DRM device reset expectations

Rob Clark robdclark at gmail.com
Tue Feb 28 17:20:55 UTC 2023


On Mon, Feb 27, 2023 at 12:40 PM André Almeida <andrealmeid at igalia.com> wrote:
>
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Signed-off-by: André Almeida <andrealmeid at igalia.com>
> ---
>  Documentation/gpu/drm-uapi.rst | 51 ++++++++++++++++++++++++++++++++++
>  1 file changed, 51 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3d6c3ed392ea 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,57 @@ for GPU1 and GPU2 from different vendors, and a third handler for
>  mmapped regular files. Threads cause additional pain with signal
>  handling as well.
>
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in the many layers in between. To recover
> +from this kind of state, sometimes is needed to reset the GPU. Unproper handling
> +of GPU resets can lead to an unstable userspace. This section describes what's
> +the expected behaviour from DRM drivers to do in those situations, from usermode
> +drivers and compositors as well. The end goal is to have a seamless experience
> +as possible, either the stack being able to recover itself or resetting to a new
> +stable state.
> +
> +Robustness
> +----------
> +
> +First of all, application robust APIs, when available, should be used. This
> +allows the application to correctly recover and continue to run after a reset.
> +Apps that doesn't use this should be promptly killed when the kernel driver
> +detects that it's in broken state. Specifically guidelines for some APIs:
> +
> +- OpenGL: KMD signals the abortion of submitted commands and the UMD should then
> +  react accordingly and abort the application.

I disagree.. what would be the point of GL_EXT_robustness
glGetGraphicsResetStatusEXT() if we are going to abort the application
before it has a chance to call this?

Also, this would break the deqp-egl robustness tests because they
would start crashing ;-)

> +
> +- Vulkan: Assumes that every app is able to deal with ``VK_ERROR_DEVICE_LOST``.
> +  If it doesn't do it right, it's considered a broken application and UMD will
> +  deal with it, aborting it.
> +
> +Kernel mode driver
> +------------------
> +
> +The KMD must be able to detect that something is wrong with the application
> +and that a reset is needed to take place to recover the device (e.g. an endless
> +wait). It needs to properly track the context that is broken and mark it as
> +dead, so any other syscalls to that context should be further rejected. The
> +other contexts should be preserved when possible, avoid crashing the rest of
> +userspace. KMD can ban a file descriptor that keeps causing resets, as it's
> +likely in a broken loop.

syscalls to the context?  Like the one querying the reset status?  :-P

In general I don't think the KMD should block syscalls.  _Maybe_ there
could be some threshold at which point we start blocking things, but I
think that would still cause problems with deqp-egl.

What we should perhaps do is encourage drivers to implement
devcoredump support for logging/reporting GPU crashes.  This would
have the benefit that distro error reporting could be standardized.
And hopefully some actionable bug reports come out of it.

And maybe we could standardize UABI for reporting crashes so a
compositor has a chance to realize an app is crashing and take action.
(But again, how does the compositor know that this isn't intentional,
it would be kinda inconvenient if the compositor kept killing my deqp
runs.)  But for all the rest, nak

BR,
-R


> +
> +User mode driver
> +----------------
> +
> +During a reset, UMD should be aware that rejected syscalls indicates that the
> +context is broken and for robust apps the recovery should happen for the
> +context. Non-robust apps must be terminated.
> +
> +Compositors
> +-----------
> +
> +Compositors should be robust as well to properly deal with its errors.
> +
> +
>  .. _drm_driver_ioctl:
>
>  IOCTL Support on Device Nodes
> --
> 2.39.2
>


More information about the amd-gfx mailing list