[PATCH v5 1/1] drm/doc: Document DRM device reset expectations

Wed Jul 5 06:30:49 UTC 2023

On Tue, Jul 4, 2023, 03:55 Michel Dänzer <michel.daenzer at mailbox.org> wrote:

> On 7/4/23 04:34, Marek Olšák wrote:
> > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer at mailbox.org
> <mailto:michel.daenzer at mailbox.org>> wrote:
> >     On 6/30/23 22:32, Marek Olšák wrote:
> >     > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <
> michel.daenzer at mailbox.org <mailto:michel.daenzer at mailbox.org> <mailto:
> michel.daenzer at mailbox.org <mailto:michel.daenzer at mailbox.org>>> wrote:
> >     >> On 6/30/23 16:59, Alex Deucher wrote:
> >     >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> >     >>> <sebastian.wick at redhat.com <mailto:sebastian.wick at redhat.com>
> <mailto:sebastian.wick at redhat.com <mailto:sebastian.wick at redhat.com>>>
> wrote:
> >     >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <
> andrealmeid at igalia.com <mailto:andrealmeid at igalia.com> <mailto:
> andrealmeid at igalia.com <mailto:andrealmeid at igalia.com>>> wrote:
> >     >>>>>
> >     >>>>> +Robustness
> >     >>>>> +----------
> >     >>>>> +
> >     >>>>> +The only way to try to keep an application working after a
> reset is if it
> >     >>>>> +complies with the robustness aspects of the graphical API
> that it is using.
> >     >>>>> +
> >     >>>>> +Graphical APIs provide ways to applications to deal with
> device resets. However,
> >     >>>>> +there is no guarantee that the app will use such features
> correctly, and the
> >     >>>>> +UMD can implement policies to close the app if it is a
> repeating offender,
> >     >>>>> +likely in a broken loop. This is done to ensure that it does
> not keep blocking
> >     >>>>> +the user interface from being correctly displayed. This
> should be done even if
> >     >>>>> +the app is correct but happens to trigger some bug in the
> hardware/driver.
> >     >>>>
> >     >>>> I still don't think it's good to let the kernel arbitrarily kill
> >     >>>> processes that it thinks are not well-behaved based on some
> heuristics
> >     >>>> and policy.
> >     >>>>
> >     >>>> Can't this be outsourced to user space? Expose the information
> about
> >     >>>> processes causing a device and let e.g. systemd deal with
> coming up
> >     >>>> with a policy and with killing stuff.
> >     >>>
> >     >>> I don't think it's the kernel doing the killing, it would be the
> UMD.
> >     >>> E.g., if the app is guilty and doesn't support robustness the
> UMD can
> >     >>> just call exit().
> >     >>
> >     >> It would be safer to just ignore API calls[0], similarly to what
> is done until the application destroys the context with robustness. Calling
> exit() likely results in losing any unsaved work, whereas at least some
> applications might otherwise allow saving the work by other means.
> >     >
> >     > That's a terrible idea. Ignoring API calls would be identical to a
> freeze. You might as well disable GPU recovery because the result would be
> the same.
> >
> >     No GPU recovery would affect everything using the GPU, whereas this
> affects only non-robust applications.
> >
> > which is currently the majority.
>
> Not sure where you're going with this. Applications need to use robustness
> to be able to recover from a GPU hang, and the GPU needs to be reset for
> that. So disabling GPU reset is not the same as what we're discussing here.
>
>
> >     > - non-robust contexts: call exit(1) immediately, which is the best
> way to recover
> >
> >     That's not the UMD's call to make.
> >
> > That's absolutely the UMD's call to make because that's mandated by the
> hw and API design
>
> Can you point us to a spec which mandates that the process must be killed
> in this case?
>
>
> > and only driver devs know this, which this thread is a proof of. The
> default behavior is to skip all command submission if a non-robust context
> is lost, which looks like a freeze. That's required to prevent infinite
> hangs from the same context and can be caused by the side effects of the
> GPU reset itself, not by the cause of the previous hang. The only way out
> of that is killing the process.
>
> The UMD killing the process is not the only way out of that, and doing so
> is overreach on its part. The UMD is but one out of many components in a
> process, not the main one or a special one. It doesn't get to decide when
> the process must die, certainly not under circumstances where it must be
> able to continue while ignoring API calls (that's required for robustness).
>

You're mixing things up. Robust apps don't any special action from a UMD.
Only non-robust apps need to be killed for proper recovery with the only
other alternative being not updating the window/screen, which is not
user-friendly because the user who's never heard of GPU hangs has no
fucking idea why it's frozen and what do with it. It doesn't matter that
you can debug it because you're not the average user. Also it's already
used and required by our customers on Android because killing a process
returns the user to the desktop screen and can generate a crash dump
instead of keeping the app output frozen, and they agree that this is the
best user experience given the circumstances.

Also if the ML ignores html, that's fine.

Marek

>
> >     >>     [0] Possibly accompanied by a one-time message to stderr
> along the lines of "GPU reset detected but robustness not enabled in
> context, ignoring OpenGL API calls".
>
>
> --
> Earthling Michel Dänzer            |                  https://redhat.com
> Libre software enthusiast          |         Mesa and Xwayland developer
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/dri-devel/attachments/20230705/0cfe2b2c/attachment-0001.htm>