Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)

Wed Aug 2 07:38:36 UTC 2023

A screen that doesn't update isn't usable. Killing the window system
and returning to the login screen is one option. Killing the window
system manually from a terminal or over ssh and then returning to the
login screen is another option, but 99% of users either hard-reset the
machine or do sysrq+REISUB anyway because it's faster that way. Those
are all your options. If we don't do the kill, users might decide to
do a hard reset with an unsync'd file system, which can cause more
damage.

The precedent from the CPU land is pretty strong here. There is
SIGSEGV for invalid CPU memory access and SIGILL for invalid CPU
instructions, yet we do nothing for invalid GPU memory access and
invalid GPU instructions. Sending a terminating signal from the kernel
would be the most natural thing to do. Instead, we just keep a frozen
GUI to keep users helpless, or we continue command submission and then
the hanging app can cause an infinite cycle of GPU hangs and resets,
making the GPU unusable until somebody kills the app over ssh.

That's why GL/Vulkan robustness is required - either robust apps, or a
robust compositor that greys out lost windows and pops up a diagnostic
message with a list of actions to choose from. That's the direction we
should be taking. Non-robust apps under a non-robust compositor should
just be killed if they crash the GPU.

Marek

On Wed, Jul 26, 2023 at 4:07 AM Michel Dänzer
<michel.daenzer at mailbox.org> wrote:
>
> On 7/25/23 15:02, André Almeida wrote:
> > Em 25/07/2023 05:03, Michel Dänzer escreveu:
> >> On 7/25/23 04:55, André Almeida wrote:
> >>> Hi everyone,
> >>>
> >>> It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that.
> >>>
> >>> Em 27/06/2023 10:23, André Almeida escreveu:
> >>>> +Robustness
> >>>> +----------
> >>>> +
> >>>> +The only way to try to keep an application working after a reset is if it
> >>>> +complies with the robustness aspects of the graphical API that it is using.
> >>>> +
> >>>> +Graphical APIs provide ways to applications to deal with device resets. However,
> >>>> +there is no guarantee that the app will use such features correctly, and the
> >>>> +UMD can implement policies to close the app if it is a repeating offender,
> >>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
> >>>> +the user interface from being correctly displayed. This should be done even if
> >>>> +the app is correct but happens to trigger some bug in the hardware/driver.
> >>>> +
> >>> Depending on the OpenGL version, there are different robustness API available:
> >>>
> >>> - OpenGL ABR extension [0]
> >>> - OpenGL KHR extension [1]
> >>> - OpenGL ES extension  [2]
> >>>
> >>> Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset.
> >>>
> >>> What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it.
> >>>
> >>> However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive.
> >>>
> >>> Given that, I believe drivers have the following options:
> >>>
> >>> a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications.
> >>>
> >>> b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?).
> >>>
> >>> c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets.
> >>>
> >>> RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5].
> >>>
> >>> For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications.
> >>
> >> I disagree on these conclusions.
> >>
> >> c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not.
> >>
> >
> > Thank you for the feedback. How do you think the documentation should look like for this part?
>
> The initial paragraph about robustness should say "keep OpenGL working" instead of "keep an application working". If an OpenGL context stops working, it doesn't necessarily mean the application stops working altogether.
>
>
> If the application doesn't use the robustness extensions, your option b) is what should happen by default whenever possible. And it really has to be possible if the robustness extensions are supported.
>
>
> --
> Earthling Michel Dänzer            |                  https://redhat.com
> Libre software enthusiast          |         Mesa and Xwayland developer
>