<div dir="auto"><div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jun 27, 2023, 09:23 André Almeida <<a href="mailto:andrealmeid@igalia.com">andrealmeid@igalia.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Create a section that specifies how to deal with DRM device resets for<br> kernel and userspace drivers.<br> <br> Acked-by: Pekka Paalanen <<a href="mailto:pekka.paalanen@collabora.com" target="_blank" rel="noreferrer">pekka.paalanen@collabora.com</a>><br> Signed-off-by: André Almeida <<a href="mailto:andrealmeid@igalia.com" target="_blank" rel="noreferrer">andrealmeid@igalia.com</a>><br> ---<br> <br> v4: <a href="https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/" rel="noreferrer noreferrer" target="_blank">https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/</a><br> <br> Changes:<br> - Grammar fixes (Randy)<br> <br> Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++<br> 1 file changed, 68 insertions(+)<br> <br> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst<br> index 65fb3036a580..3cbffa25ed93 100644<br> --- a/Documentation/gpu/drm-uapi.rst<br> +++ b/Documentation/gpu/drm-uapi.rst<br> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for<br> mmapped regular files. Threads cause additional pain with signal<br> handling as well.<br> <br> +Device reset<br> +============<br> +<br> +The GPU stack is really complex and is prone to errors, from hardware bugs,<br> +faulty applications and everything in between the many layers. Some errors<br> +require resetting the device in order to make the device usable again. This<br> +sections describes the expectations for DRM and usermode drivers when a<br> +device resets and how to propagate the reset status.<br> +<br> +Kernel Mode Driver<br> +------------------<br> +<br> +The KMD is responsible for checking if the device needs a reset, and to perform<br> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD<br> +should keep track of resets, because userspace can query any time about the<br> +reset stats for an specific context. This is needed to propagate to the rest of<br> +the stack that a reset has happened. Currently, this is implemented by each<br> +driver separately, with no common DRM interface.<br> +<br> +User Mode Driver<br> +----------------<br> +<br> +The UMD should check before submitting new commands to the KMD if the device has<br> +been reset, and this can be checked more often if the UMD requires it. After<br> +detecting a reset, UMD will then proceed to report it to the application using<br> +the appropriate API error code, as explained in the section below about<br> +robustness.<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">The UMD won't check the device status before every command submission due to ioctl overhead. Instead, the KMD should skip command submission and return an error that it was skipped.</div><div dir="auto"><br></div><div dir="auto">The only case where that won't be applicable is user queues where drivers don't call into the kernel to submit work, but they do call into the kernel to create a dma_fence. In that case, the call to create a dma_fence can fail with an error.</div><div dir="auto"><br></div><div dir="auto">Marek</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> +<br> +Robustness<br> +----------<br> +<br> +The only way to try to keep an application working after a reset is if it<br> +complies with the robustness aspects of the graphical API that it is using.<br> +<br> +Graphical APIs provide ways to applications to deal with device resets. However,<br> +there is no guarantee that the app will use such features correctly, and the<br> +UMD can implement policies to close the app if it is a repeating offender,<br> +likely in a broken loop. This is done to ensure that it does not keep blocking<br> +the user interface from being correctly displayed. This should be done even if<br> +the app is correct but happens to trigger some bug in the hardware/driver.<br> +<br> +OpenGL<br> +~~~~~~<br> +<br> +Apps using OpenGL should use the available robust interfaces, like the<br> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This<br> +interface tells if a reset has happened, and if so, all the context state is<br> +considered lost and the app proceeds by creating new ones. If it is possible to<br> +determine that robustness is not in use, the UMD will terminate the app when a<br> +reset is detected, giving that the contexts are lost and the app won't be able<br> +to figure this out and recreate the contexts.<br> +<br> +Vulkan<br> +~~~~~~<br> +<br> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.<br> +This error code means, among other things, that a device reset has happened and<br> +it needs to recreate the contexts to keep going.<br> +<br> +Reporting causes of resets<br> +--------------------------<br> +<br> +Apart from propagating the reset through the stack so apps can recover, it's<br> +really useful for driver developers to learn more about what caused the reset in<br> +first place. DRM devices should make use of devcoredump to store relevant<br> +information about the reset, so this information can be added to user bug<br> +reports.<br> +<br> .. _drm_driver_ioctl:<br> <br> IOCTL Support on Device Nodes<br> -- <br> 2.41.0<br> <br> </blockquote></div></div></div>