Questions about object ID lifetimes

Mon Sep 18 15:31:18 UTC 2023

On Mon, 18 Sep 2023 14:06:51 +0300
Pekka Paalanen <ppaalanen at gmail.com> wrote:

> On Sat, 16 Sep 2023 12:18:35 -0400
> jleivent <jleivent at comcast.net> wrote:
> 
> > The easiest fix I can think of is to go full-on half duplex.
> > Meaning that each side doesn't send a single message until it has
> > fully processed all messages sent to it in the order they arrive
> > (thankfully, sockets preserve message order, else this would be
> > much harder). Have you considered half duplex?  
> 
> Never crossed my mind at least. I can't even imagine how it could be
> implemented through a socket, because both sides must be able to
> spontaneously send a message at any time.

By taking turns.  Each side would, after queuing up a batch of
messages, add an "Over!" message (from the days of half-duplex
radio communications) to the end of that queue, and then send the whole
queue (retaining its sequence).  Neither side would send a message
until it receives the other side's "Over!" message, and until the
higher levels above libwayland have had a chance to examine all
messages prior to "Over!" in order to avoid sending an inconsistent
message or even committing to a state incompatible with later messages.

> 
> > Certainly, it would mean a loss
> > of some concurrency, hence a potential performance hit.  But
> > probably not that much in this case, as most of the message
> > back-and-forth in Wayland occurs at user-interaction speeds, while
> > the speed-needing stuff happens through fd sharing and similar
> > things outside the protocol. I  
> 
> That user interaction speed can be in the order of a kilohertz, for
> gaming mice, at least in one direction. In the other direction,
> surface update rate is also unlimited, games may want to push out
> frames even if only every tenth gets displayed to reduce latency.
> Also truly tearing screen updates are being developed.

But aren't those fast frame updates done through shared fds?  Hence not
part of the wire protocol, and would not be impacted by increasing the
length of messages on the wire?

> 
> > think it can be made mostly backward compatible. It would probably
> > require some "all done" interaction between libwayland and higher
> > levels on each side, but that's probably (hopefully) not too hard.
> > There may even be a way to automate the "all done" interaction to
> > make this fully backward compatible, because libwayland knows when
> > there are no more messages to be processed on the wire, and it can
> > queue-up the messages on each side before placing them on the wire.
> >  It might need to do things like re-order ping/pong messages with
> > respect to the others to make sure the pinging side (compositor)
> > doesn't declare the client dead while waiting.  But that seems
> > minor, as long as all such ping/pong pairs are opaque to the
> > remainder of the protocol, hence always commute with other
> > messages.  
> 
> If you mean adding new ping/pong stuff, that doesn't sound very nice,
> because Wayland also aims to be power efficient: if truly nothing is
> happening, let the processes sleep. Anyone could still wake up any
> time, and send a message.

Not adding.  Dealing with the already existing (or if any new ones are
added) ping/pong pairs.  Or any messages that really need to be timely,
hence can't wait for messages in front of them to be fully processed.

That could apply to any real-time requirements, like the gaming mice
messages you mentioned above.  But doing this in general is hard unless
the messages are irrelevant to the rest of the protocol (hence commute
with everything else), like ping/pong are.

> 
> 
> On Sun, 17 Sep 2023 15:28:04 -0400
> jleivent <jleivent at comcast.net> wrote:
> 
> > Has altering the wire format to contain all the info needed for
> > unambiguous decoding of each message entirely within libwayland
> > without needing to know the object ID -> type mapping been
> > considered?  
> 
> Not that I can recall. The wire format is ABI, libwayland is not the
> only implementation of it, so that would be Wayland 2 material.

So no changes to the wire format are possible under any circumstances
in Wayland 1?

> 
> > It would make the messages longer, but this seems like it wouldn't
> > be very bad for performance because wire message transfer is roughly
> > aligned with user interaction speeds.  
> 
> We need to be able to deal with at least a few thousand messages per
> second easily.
> 
> The overhead seems a bit bad if every message would need to carry its
> signature.

Encoding more into the message is only needed if there are no
destructor request acks (the equivalent of wl_display::delete_id, but
in the opposite direction).  But I was wondering why not do it for
robustness.

The signature isn't very big, but it's probably not needed even for
robustness.  What's needed is the target object type/version
information. Since from that both sides know the signature.  The issue
is just how to add robustness to the object ID -> type/version
mapping, which is the source of many problems.  The signatures are not
ambiguous after that's done.

The target type/version info could be encoded to be small if you want -
if both sides agree on an indexing scheme pairing numbers with type
names & versions, like they already do for opcodes.  It's hard to
imagine it taking more than 2 bytes (leaving you room for 2^16
type/version combos), but 4 bytes is certainly plenty.

If you don't have destructor request acks, you may also need
"generation" numbers. There's a problem about object IDs we haven't
spoken about yet, related mostly to the hard case with server-side IDs
and server-side destructor events.  It's possible (I don't know if there
are instances in the current protocol) for a series of object creations
(new_id args) and destructions all originating from the server and
all involving the same object ID such that the server might receive a
request from the client involving that object ID and agree with the
decoding about the type of the object, but not the generation of that
object.  These steps:

server sends:
  event E1 with new_id N for type T  (call the new object O1)
  ...
  destructor for id N
  ...
  event E2 with new_id N for type T  (call the new object O2)

at some later point, the server receives:
  request R involving id N (either as target or object arg) which
  matches type T based on decoding the request

The server cannot tell, even if it has all of the decode info available
(whether encoded in the wire, or because the request R gets decoded
without issue otherwise), whether the object with id N in R refers to O1
or O2.  If it refers to O1, the message is what I called "speculative"
because it was sent prior to the client seeing the destructor for id N
above.  That message could be a mistake if processed, because the
current state of the compositor has O2 with id N, and that might be a
completely distinct usage of object type T that isn't at all related to
request R.

A generation number would keep track of which particular instance
(generation) of the object with id N and type T a message is referring
to.  A generation number would be associated with each id N, and would
get incremented each time N got re-used.  It would be added to the wire
protocol for both the target objects and the object arguments.  That's
more (probably 4-byte) fields.

Again, note that this is ambiguity between O1 and O2 cannot occur if
there is some destructor ack request that the client can send and
libwayland on the server side understand (the equivalent of the
wl_display::delete_id, but going the other way). I suspect that's
better to add to the protocol instead of generation numbers.  It
also cannot occur in the half-duplex scheme, which is a more complete
solution to many problems.

I read the Protocol-next issues - so I see you've thought about
delete_id requests already.

> 
> > Also, for any compositor/client pair, as long as they both use the
> > same version of libwayland, the necessary wire format change would
> > not result in compatibility issues.  It would for static linked
> > cases, or similar mismatching cases (flatpak, appimage, snap, etc.
> > unless the host version is mapped in instead of the packaged one
> > somehow). There also seem to be unused bits in the existing wire
> > format so that one could detect an a compositor/client
> > incompatibility at least on one end.  
> 
> We've never had the requirement for compositor and clients to use the
> same minor version of libwayland. There are also completely
> independent Wayland implementations in other languages that expect to
> be interoperable. Breaking all that seems unacceptable.
> 
> What unused bits did you find?

For instance, the length field in the header.  It seems unlikely that
you need all 16 bits.  And, currently, you have hard-coded 4K ring
buffers in libwayland, so you have a hard limit on 4K (12 bits) for max
message size. Unless I'm missing some code to handle that somewhere.  I
think there is a check such that if a message in the wire claims to
have a length too big for your 4K ring buffer, that's an error.

> 
> > I'm not suggesting that unambiguous decoding of all messages is a
> > sufficient fix, but it is a necessary one.  There are still
> > speculative computation issues that it wouldn't resolve alone.  
> 
> I didn't understand what is speculative. There is no roll-back of any
> kind on anything, what's computed is final.

Right - you have speculation without roll-back, meaning you have broken
speculation!

One case of speculation we discussed already is a request containing
new_ID args, where the target object has already been deleted by the
server.  That request was speculative, since it was sent by the client
prior to it seeing the already-processed-on-the-server destruction of
the target object.  The client "speculated" that the target object will
still exist to receive the request, and did so incorrectly.

Although this type of speculation is addressed (partially) by
the addition of a delete_id request, at least with respect to
disambuating message decoding, other cases are not. My reason for
suggesting going half-duplex was about trying to prevent speculation
AND message ambiguity together with one fix.

There may be cases of speculation unrelated to object destruction.
Consider:

Side A sends message M1 then message M2 (or at least did other
state-change things after sending M1).

Side B processes M1 (but doesn't yet see M2) and sends message M3 which
is not compatible with what side A did after sending M1.  But side A
doesn't know based only on its own state that it should ignore message
M3.  And even if it does ignore M3, side B may have done processing
after sending M3 involving change its state, such that it's new state
is incompatible with side A ignoring message M3.

tl;dr: protocol asynchrony leads to speculation that can result in the
two sides disagreeing about the correct state of the world.  

BTW - I don't recommend adding rollbacks!  It's a huge can of worms
unless the speculation is very limited (as with machine instruction
speculation in modern CPUs).

You can prevent speculation for each individual case within the protocol
by having the protocol include acks, mutexes or transactional boundary
messages specific to these cases.  Or you could design the protocol so
that such M3 messages (and other sending-side state changes) must
always be compatible (commute with) with the receiving state - that's
probably very hard.

Or, you could go half-duplex, and require that neither side "commits"
to any state changes, including sending messages, until it has processed
ALL messages prior to the "Over!" message that was sent by the other
side. That prevents speculation by taking all client-vs.-server
asynchrony out of the protocol.  This half-duplex idea, as draconian
as it sounds, has one key point in its favor: it's the easiest to get
right.

Or, you can hope that cases of speculation don't arise.  Or that when
they do, the resulting mistakes aren't too severe.