Questions about object ID lifetimes

Wed Sep 20 14:05:51 UTC 2023

On Wed, 20 Sep 2023 11:30:19 +0300
Pekka Paalanen <ppaalanen at gmail.com> wrote:

> ..
> > This might help reduce those anomalous messages and be compatible
> > with Wayland 1.  Reduce the greediness of object ID reuse by:
> > 
> > - not reusing any IDs unless at least some minimum number (256?)
> >   are free
> > 
> > - reuse the freed ones in LRU fashion  
> 
> Yeah, the free list could be a FIFO instead of a LIFO.
> 
> > There are other variations of this - the point of all being to
> > increase the time between when any ID becomes free and when it is
> > reused but without causing the ID maps to grow unreasonably large,
> > or causing their maintenance to slow down.
> > 
> > Increasing the time delay between freeing and reuse (such as with a
> > higher minimum free threshold above) would probably lead to lower
> > probability of anomalous messages. You could make this tunable
> > through an environment variable.
> > 
> > Note that the two sides don't have to agree to use this less-greedy
> > ID allocation for either side to use it - and it's really only
> > important for servers anyway.  
> 
> I'm wary of solutions that reduce the risk but do not eliminate it. If
> a protocol interface design turns out racy, it would be best to find
> that out sooner than later, and evaluate fixing it. Reproducibility
> helps analysis.
> 

Here's a very wild suggestion that would eliminate it and still
be compatible with Wayland 1.  Add a delete_id request without
modifying the existing protocol.

There are (at least) two pairs of ping/pong messages in the base
protocol: xdg_wm_base and wl_shell_surface.

From what I can tell (but I only have the wlroots code to look at),
when the client sends a pong that doesn't correspond to the most recent
ping, the server completely ignores it.  Also, the serial arg used in
pings starts low and is incremented.  Also, the servers tend to reset
the serial to 0 often.  So it never increments very high (even if it
never got reset, it's probably never going over 2^31-1).

This means it's possible to use a specially crafted pong as a
delete_id request.

The client could send a pong with the highest bit on (so it won't
accidentally match a real serial and ack a real ping) and the low bits
indicating the object ID whose deletion it is acking.

The server will, when it deletes one of its own objects, keep around at
least the type (interface) until it gets this pong.

There's two versions of this: one is that clients using patched
libwayland libs send the pong on their own after seeing the server-side
object deletion.  Another is that a patched server sends a ping when it
wants to reuse an ID to force a matching pong of an unpatched client
(this one assumes a client won't queue a server-side object deletion
and pong the ping before processing the deletion, hence still be able
to send anomalous messages involving the deleted ID - so it's risky).

If the server is patched to wait for these delete_id pongs, but the
client is not, then at best the server could fall back to using a less
greedy reuse as with my previous suggestion.  A patched client could
signal it is patched by sending an unsolicited specially crafted pong
(serial arg = 0xffffffff) early on.

It might be nice to give users the ability to start out with an
unpatched libwayland, but if they think they are seeing clients getting
killed off due to deleted server IDs in their requests, they could
switch to using a patched "unauthorized" libwayland.  It probably
wouldn't be too hard to write a tool that parses WAYLAND_DEBUG output
to see if an issue is due to delete server IDs.  They'd use the patched
libwayland at their own risk (but when isn't that the case?),
understanding that the "fix" is a bit of a hack.

I'm considering forking libwayland and working on one or both of these
fixes for my own use, because I don't want to implement some even
crazier things in middleware to compensate for the server ID reuse
problem.