[PATCH v5 1/1] drm/doc: Document DRM device reset expectations
Sebastian Wick
sebastian.wick at redhat.com
Fri Jun 30 15:21:47 UTC 2023
On Fri, Jun 30, 2023 at 4:59 PM Alex Deucher <alexdeucher at gmail.com> wrote:
>
> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> <sebastian.wick at redhat.com> wrote:
> >
> > On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid at igalia.com> wrote:
> > >
> > > Create a section that specifies how to deal with DRM device resets for
> > > kernel and userspace drivers.
> > >
> > > Acked-by: Pekka Paalanen <pekka.paalanen at collabora.com>
> > > Signed-off-by: André Almeida <andrealmeid at igalia.com>
> > > ---
> > >
> > > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
> > >
> > > Changes:
> > > - Grammar fixes (Randy)
> > >
> > > Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> > > 1 file changed, 68 insertions(+)
> > >
> > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > index 65fb3036a580..3cbffa25ed93 100644
> > > --- a/Documentation/gpu/drm-uapi.rst
> > > +++ b/Documentation/gpu/drm-uapi.rst
> > > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> > > mmapped regular files. Threads cause additional pain with signal
> > > handling as well.
> > >
> > > +Device reset
> > > +============
> > > +
> > > +The GPU stack is really complex and is prone to errors, from hardware bugs,
> > > +faulty applications and everything in between the many layers. Some errors
> > > +require resetting the device in order to make the device usable again. This
> > > +sections describes the expectations for DRM and usermode drivers when a
> > > +device resets and how to propagate the reset status.
> > > +
> > > +Kernel Mode Driver
> > > +------------------
> > > +
> > > +The KMD is responsible for checking if the device needs a reset, and to perform
> > > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> > > +should keep track of resets, because userspace can query any time about the
> > > +reset stats for an specific context. This is needed to propagate to the rest of
> > > +the stack that a reset has happened. Currently, this is implemented by each
> > > +driver separately, with no common DRM interface.
> > > +
> > > +User Mode Driver
> > > +----------------
> > > +
> > > +The UMD should check before submitting new commands to the KMD if the device has
> > > +been reset, and this can be checked more often if the UMD requires it. After
> > > +detecting a reset, UMD will then proceed to report it to the application using
> > > +the appropriate API error code, as explained in the section below about
> > > +robustness.
> > > +
> > > +Robustness
> > > +----------
> > > +
> > > +The only way to try to keep an application working after a reset is if it
> > > +complies with the robustness aspects of the graphical API that it is using.
> > > +
> > > +Graphical APIs provide ways to applications to deal with device resets. However,
> > > +there is no guarantee that the app will use such features correctly, and the
> > > +UMD can implement policies to close the app if it is a repeating offender,
> > > +likely in a broken loop. This is done to ensure that it does not keep blocking
> > > +the user interface from being correctly displayed. This should be done even if
> > > +the app is correct but happens to trigger some bug in the hardware/driver.
> >
> > I still don't think it's good to let the kernel arbitrarily kill
> > processes that it thinks are not well-behaved based on some heuristics
> > and policy.
> >
> > Can't this be outsourced to user space? Expose the information about
> > processes causing a device and let e.g. systemd deal with coming up
> > with a policy and with killing stuff.
>
> I don't think it's the kernel doing the killing, it would be the UMD.
> E.g., if the app is guilty and doesn't support robustness the UMD can
> just call exit().
Ah, right, completely skipped over the UMD part. That makes more sense.
>
> Alex
>
> >
> > > +
> > > +OpenGL
> > > +~~~~~~
> > > +
> > > +Apps using OpenGL should use the available robust interfaces, like the
> > > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> > > +interface tells if a reset has happened, and if so, all the context state is
> > > +considered lost and the app proceeds by creating new ones. If it is possible to
> > > +determine that robustness is not in use, the UMD will terminate the app when a
> > > +reset is detected, giving that the contexts are lost and the app won't be able
> > > +to figure this out and recreate the contexts.
> > > +
> > > +Vulkan
> > > +~~~~~~
> > > +
> > > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> > > +This error code means, among other things, that a device reset has happened and
> > > +it needs to recreate the contexts to keep going.
> > > +
> > > +Reporting causes of resets
> > > +--------------------------
> > > +
> > > +Apart from propagating the reset through the stack so apps can recover, it's
> > > +really useful for driver developers to learn more about what caused the reset in
> > > +first place. DRM devices should make use of devcoredump to store relevant
> > > +information about the reset, so this information can be added to user bug
> > > +reports.
> > > +
> > > .. _drm_driver_ioctl:
> > >
> > > IOCTL Support on Device Nodes
> > > --
> > > 2.41.0
> > >
> >
>
More information about the dri-devel
mailing list