Compositor handoffs: Switching clients between compositors

Mon Aug 16 13:13:06 UTC 2021

Hi all,

I have been working on a method of "Compositor handoffs"--allowing
clients to switch between compositors at runtime--and I wanted to
share my progress with everyone and gather some feedback.

# Overview

The design is very simple at it's core. A client knows everything
about the state of it's windows and still owns all appropriate memory.
In the event of a compositor exit, the client should be able to just
send everything again like it's a brand new client.

I have successfully ported: Qt, SDL and XWayland to do this, and they
have worked with practically every application using those toolkits
that I have thrown at it.

The primary goal of this is to handle a compositor crash. Whilst rare,
it happens to some extent to all of the desktops and with wayland
still being such a dynamic landscape and taking on more and more
responsibilities for security reasons this isn't something that will
end anytime soon. We want to make sure we don't have a path where
users can lose data.

Long term this could solve other future goals, like switching between
compositors across graphic cards, supporting CRIU (checkpoint restore
in userspace) for GUI applications for mobile, or dynamically
switching between the active compositor and waypipe. All of which will
be really interesting new features.

It also makes the developer experience a billion times better, as I am
no longer needing to log out to actually test changes, or only test in
contrived nested situations.

# Slightly more detail

## The compositor

The only change we needed to make in the compositor was making our
public facing socket (i.e /var/run/.../wayland-0) persist on the file
system. We run a small helper that creates the socket then spawns our
real compositor. If our compositor crashes, we re-use the same socket.
Clients still receive the same error about being disconnected, but
there are no races should a client try to reconnect whilst a
compositor is starting.

systemd's socket activation would also be an option here, though we
ended up writing something custom.

## Toolkit changes

Whilst this appears as an invasive thing to retroactively insert, it
turns out to be relatively unintrusive and small. Most of the tasks we
need to do involve functions that already exist anyway.

We need to:
     - reset every wl_output (which can happen at runtime already)
     - reset input devices (which can happen at runtime already)
     - reset our data devices (which can get cancelled at runtime already)
     - reset our wl_surface/xdg_shell objects
     - send new contents (which happens already)

So for the most part we're just hooking into existing code.

## SDL: https://github.com/davidedmundson/SDL/ branch reconnect

With the above patchset if the compositor crashes, all SDL apps will
reconnect. This was tested against everything in SDL's test directory
as well as obviously supertuxkart, the most important client on all
our machines.

This is a good example to look at to see the extent of toolkit
changes, the changes come in at around 200 lines, and the vast
majority of that is in the cursor code.

## Qt (Qt5 version)
https://invent.kde.org/davidedmundson/qtwayland/-/tree/reconnect_main

Qt's usage is slightly  more complicated than SDL, but still the
changes are still relatively small and managable.

I have been running this for the best part of 6 months, and everything
"just works". I've thrown games, full IDE's (kdevelop) and image
edtiors at this and everything works flawlessly without client
changes.

This is slightly held back by us having to be API and ABI compatible
with the frozen Qt5 base, a change landing in Qt6 could potentially
address this cleaner, especially the part about refreshing the
QBackingStore.

Note that if running Qt on Plasma we have a "plasma-integration"
plugin that also makes some wayland calls that has also needed
adjusting. The wiki
(https://invent.kde.org/plasma/kwin/-/wikis/Restarting) lists all
repositories and pending changes.

## XWayland (somewhat WIP):
https://gitlab.freedesktop.org/davidedmundson/xserver/-/tree/reconnections

Xwayland was a whole new adventure, but the code for resending windows
was very simple even tackling details like pointer constraints. Being
xwayland it even retains positional information as the X client itself
remains intact.

The biggest challenge here is that the compositor typically starts
Xwayland and does so with custom file descriptors for direct
connections. That obviously doesn't work for handling the compositor
changing. This required quite a lot of reshuffling about--moving
XWwayland spawning to the wrapping helper, as well as adding new
mutexes to handle startup being announced. Ultimately everything was
do-able. Messy branches for that are available on request.

## Other toolkits.

One nice thing is that it's completely opt-in on the client. For some
system services (i.e plasmashell ) just restarting is a perfectly
acceptable approach. For wl-paste, it may as well just exit. Firefox
has such good in-built crash handling that I just wrapped the
executable in a tiny wrapper and it works well enough. We don't need
to have one solution fitting everything.

Ultimately the absolutely worst case is the application quits/crashes,
which is exactly what happens now.

## Handling openGL https://gitlab.freedesktop.org/davidedmundson/mesa/
branch reconnect

OpenGL presented the biggest challenge. We only have the one
EGLDisplay object persist for the lifespan of the application, we pass
the wl_display constructor in as an argument, and therefore we need to
persist the same wl_display object across connections.

This has meant we need to change both Mesa and libwayland:

When a client reconnects after an error a signal is emitted which is
picked up by mesa. Any globals are then replaced, so new objects are
created against the new connection.

With these changes the rest is easy. We need to recreate the EGLWindow
against a new wl_surface on the new connection, but the context
survives completely untouched and we can render a new frame right
away.

## libwayland changes:
https://gitlab.freedesktop.org/davidedmundson/wayland/-/tree/reconnections

As mentioned above, we have a new method called "reconnect" and a
signal that is emitted to all subscribers once reconnect is called. As
it introduces new API all other changes rely on it.

Another big change here is handling of dangling objects. After a
client reconnects all proxy objects are marked as "defunct" and we no
longer send any requests on these objects, as the compositor is
obviously unaware of these IDs. However calling the proxy's destructor
needs to still work to free memory.  By doing this inside libwayland
we are able to make it quite easy for toolkits to recreate objects
"lazily". This has proved very useful when wayland proxies are hidden
inside opaque pointers inside the toolkit, or if rendering is on
another thread.

# Expected questions

## Pics or it didn't happen

Here is a video of restoration working fully:
https://home.davidedmundson.co.uk/index.php/s/KbQyb3eiBodTcFm

## If the changes address only toolkits, what if I use low level
wayland proxies in my apps?

The relevant change (watching for the signal on the wl_display and
recreating a registry, new globals and new objects would need to be
redone in the client. Within KDE, at least direct use of proxies
outside the toolkit and a few select libraries is very very rare.

## What happens to the clipboard

The current selection is lost, just like if the client closed. A
clipboard manager can alleviate this. Every new selection afterwards
behaves as expected.

## What about window size/position/stacking order

Because everything is treated as new windows (from a compositor POV)
stacking order and positioning is effectively random. Size is often
maintained as the clients do know their original size. Session
managment (https://gitlab.freedesktop.org/wayland/wayland-protocols/-/merge_requests/18)
is a possible solution to this.

## Could it be done without toolkit changes via a magic proxy?

Possibly, but I don't think it adds anything. Clients are the
canonical source of what a client needs, and knows how to handle any
sudden changes the best. There is also zero overhead this way.

## What if the EGLDisplay capabilities change?

My mesa change keeps the same connection to the card specified by the
first wl_drm.device event from the first wayland connection. If the
new compositor passes a different device it is ignored.
That definitely has some theoretical quirks, but none that have come
up in practice for the immediate task at hand.

## What if the compositor has no support?

Then the socket on the file system will disappear and clients will
fail to reconnect and quit. We also use this in clients to detect a
graceful kwin exit and just exit accordingly.

## What about vulkan?

It will need similar changes to what we've done for OpenGL.

## That was too much text

There will be an XDC talk
(https://indico.freedesktop.org/event/1/contributions/20/) where I
will describe everything mentioned here.

# Next steps

The changes to libwayland is the most invasive, and the most
potentially controversial. I wanted to send an overview email and get
a discussion going before I submitted merge requests. Unfortunately
it's the blocker on everything else.