Questions about object ID lifetimes

Fri Sep 15 12:18:58 UTC 2023

On Thu, 14 Sep 2023 15:10:48 -0400
jleivent <jleivent at comcast.net> wrote:

> On Thu, 14 Sep 2023 16:32:06 +0300
> Pekka Paalanen <ppaalanen at gmail.com> wrote:
> 
> > 
> > As an aside, we collect unfixable issues under
> > https://gitlab.freedesktop.org/wayland/wayland/-/issues/?label_name%5B%5D=Protocol-next
> > These are issues that are either impossible or very difficult or
> > annoying to fix while keeping backward compatibility with both servers
> > and clients.  
> 
> Only 7 of them?

Some of them are bags of issues.

Feel free to collect your ideas in new issues though. You may find
interested people.

> > ------
> > 
> > Object ID re-use is what I would call "aggressive": in the libwayland
> > C implementation, the object ID last freed is the first one to be
> > allocated next. There are two separate allocation ranges each with its
> > own free list: server and client allocated IDs.  
> 
> After I sent the initial post, I realized that the two separate
> ID ranges help in the following way:
> 
> For any object ID in the allocation range of side A, a destructor
> message from side B does not need acknowledgement.  This is because B
> can't introduce a new object bound to that ID, only A can.  Hence, any
> new_id arg for that ID is an acknowledgement of the destruction.

That's an interesting interpretation. Perhaps it works for your case,
but I would not use in regular clients and servers, because I'd like to
be able to catch ID re-use errors like libwayland does.

> However, B has to be careful to ignore messages containing that ID
> until it sees one with the ID as a new_id arg.  After the destructor
> message from B but before a subsequent new_id for that ID from A, B
> should not use the ID as arguments to other messages (and attempts to
> do so can be dropped).  And this can be automated provided the
> destructor tag can be relied on.
> 
> > 
> > The C implementation also poses an additional restriction: a new ID
> > can be at most the largest ever allocated ID + 1.
> > 
> > All this is to keep the ID map as compact as possible without a hash
> > table. These details are in the implementation of the private 'struct
> > wl_map' in libwayland.  
> 
> Obviouly, that helps middleware as well, for the same reasons.  It also
> makes more automatic error detection possible.
> 
> > ...
> > 
> > Your whole above analysis is completely correct!  
> 
> I was rather hoping things would turn out less complex than they
> seemed...
> 
> >   
> > > However, the other cases are not as easy to identify.
> > > 
> > > The other cases are:
> > > 1. an object created by a client request that has destructor events
> > > 2. an object created by the compositor
> > > 
> > > It might be true that case 1 does not exist.  Is there a general
> > > rule against that such cases would never be considered in future
> > > expansions of the Wayland protocol?    
> > 
> > Destructor events do exist. Tagging them as such in the XML was not
> > done from the beginning though, it was added later in a
> > backward-compatible manner which makes the tag more informational than
> > something libwayland could automatically process. The foremost example
> > is wl_callback.done event. This is only safe because it is guaranteed
> > that the client cannot be sending a request on wl_callback at the same
> > time the server is sending 'done' and destroying the object:
> > wl_callback has no requests defined at all.  
> 
> Fortunately, my point above about the advantage of the separate ID
> ranges helps here.  If wl_callback is created by the client, then a
> wl_callback.done event tagged as a destructor does not need
> acknowledgement AND is always safe provided that messages involving the
> wl_callback ID (other than it's eventual reuse as a new_id arg) are
> ignored above libwayland.

I think there is another asymmetry here. libwayland-client definitely
ignores events on an object ID whose wl_proxy has been destroyed from
the API user point of view. libwayland-server though seems to be
throwing a protocol error immediately on any non-existing object ID
receiving a message.

I think there is a case that requires the delete_id event and cannot
work solely on the new_id ID re-use rule:
- client sends request to create wl_callback
- client destroys the wl_callback (no request, enters zombie state client side)
- client creates some other new object X

When the server destroys the wl_callback, it sends the done event, and
delete_id event, and cannot know if the client has already destroyed
the wl_callback or not.

Zombie IDs are not eligible for re-use. The zombie is destroyed and the
ID freed on delete_id event. If the client has not seen the delete_id
yet, it allocates a new high ID for object X. Otherwise, it re-uses the
wl_callback's old ID for X.

However, there is no delete_id in the opposite direction, meaning that
a similar situation in the opposite direction is racy and can lead to
confusing object IDs. I guess this is what you already found out.

Zombies are actually what allows libwayland-client to ignore received
events, as the events still need to parsed enough to be able to extract
and close all fds they might carry. Without a zombie, an event to a
non-existing ID is a protocol error.

> But again, this means the destructor tag is important and not merely
> informational.
> 
> I did notice that the destructor tagging was added mostly (or
> solely) to help with code generation by wayland-scanner implementations
> in programming languages where destructors require some specific
> syntactic notation.

Because other language bindings depend on destructor tagging on events,
that is a fairly good insurance that the tagging is correct.

> But maybe destructor tagging is even better than that?  Maybe it would
> allow libwayland to automate more in a more robust way AND also allow
> for middleware that doesn't have to simulate all of the semantic level
> interactions induced by protocol messages in order to merely keep track
> of how to decode messages.

Libwayland needs to be backward compatible. Therefore it has to handle
interface definitions that lack destructor event tagging. Since the
interface definitions are built into binaries, you could even have
mismatching tagging between server and client, yet they still need to
work.

This is a concern for other language bindings as well, because they
probably want to work even if the other side is using libwayland.

> > It also requires that nothing passes an existing wl_callback object as
> > an argument in any request. We have been merely lucky that no-one has
> > done that. It's really hard to imagine a use case where you would want
> > to pass an existing wl_callback to anything.  
> 
> Again, the above separate ID ranges point addresses this, I think.

I don't think so. Libwayland seems very aggressive in erroring out for
unknown object IDs. Only zombies are allowed to be ignored, and
server-side does not use zombies as it also has no delete_id equivalent
request.

> > 
> > Extensions may have similar objects that only deliver some one-off
> > events and then "self-destruct" by the final event. All this is simply
> > documented and not marked in the XML.  
> 
> That's what I was hoping to avoid.  If there are object types where
> object lifetime can only be understood by simulating all of the
> relevant semantic content of the messages involved, then that's not
> good for middleware.  Isn't it also problematic towards the goals of
> libwayland, because it makes it impossible for libwayland to ensure
> that messages are properly decoded without trusting that the client
> and/or compositor have implemented everything properly?  Because
> demarshaling a message incorrectly seems like it would be bad news.

Indeed.

Well, libwayland does get the message signatures, so it can verify
everything is demarshalled correctly. What it cannot verify though is
that wl_resources are destroyed exactly at the right time. It's up to
user code in the server to call wl_resource_destroy() at the right
message.

This is again asymmetric with client side, where generated language
bindings for requests need to signal also destruction. Otherwise they
will have multithreading races.

Server side libwayland API is simply not thread safe, while client side
API attempts to allow it.

> > The asymmetry in the fundamental protocol bookkeeping messages
> > (wl_display.delete_id) is unfortunate in hindsight. Client created and
> > client destroyed objects are best supported, anything else will
> > require very careful messaging design to avoid a race pitfall. We
> > have a proper tear-down sequence for a client destroying an object,
> > but we do not have a similar sequence for the server destroying an
> > object, where the client would ack the destruction before the object
> > ID is re-used.  
> 
> If you add one, it would be compatible with giving importance to
> destructor tagging if done as follows:  The initial even proposing the
> removal of the object is NOT tagged as a destructor, but the
> acknowledgement from the client IS tagged as a destructor.  That would
> become a case like the ones above where a destructor is issued by the
> side opposite the one owning the object ID range: such destructors can
> be the final message and don't themselves need further acknowledgement.

I'm not sure we can add one. It would be quite a contortion, unless you
mean the hypothetical Wayland 2.

> > > For objects created by the compositor, there are 2 subcases:
> > > 
> > > 2a. objects with only destructor events
> > > 2b. objects with destructor requests
> > > 
> > > Again, it might be the case that 2b does not exist, as it is
> > > analogous to case 1 above.  But, is there a general rule against
> > > such future cases as well?  Combining 1 and 2b, is there a general
> > > rule that says that only the object creator can initiate an
> > > object's destruction (unprovoked by the other side of the
> > > protocol)?    
> > 
> > There is no such rule, but nowadays the recommendation is to keep to
> > client created and client destroyed objects unless there are pressing
> > reasons to design something more delicate/fragile.
> > 
> > 2b exists as wl_data_offer, created by wl_data_device.data_offer
> > event.  
> 
> Again, saved by my observation above about the safety added by separate
> object ID ranges for clients vs. compositor.
> 
> > 
> > I'm not aware of 2a existing in the wild, but they could I guess.
> > 
> > Generally we try to avoid new_id in events, because it causes a race
> > by default: what if the server is sending a new_id event at the same
> > time the client destroys the originating object? The result is a
> > mismatch in object ID tracking between server and client, causing
> > unexpected failures perhaps much later.  
> 
> This can occur with the sides swapped, couldn't it?  If a new_id arg
> appears in a message such that the destination side for that message can
> initiate destruction of the message's target object, then if things are
> not handled carefully, the new_id leaks (isn't ever reused).  To
> prevent this leak, you either cannot allow that side to initiate
> destruction of the target object, or you have to keep at least the type
> information for the target object around until the destruction is
> acknowledged.  If you did keep the type information, you could continue
> to decode the messages and send something like delete_id responses for
> each such new_id.

This problem is avoided by knowing the race exists and designing
accordingly: if an interface has any requests at all, it must not have
a destructor event.

> > 
> > When a client (or server) calls the respective API to destroy a
> > protocol object, libwayland guarantees that the object's
> > listeners/callbacks won't be called even if messages are still
> > received on that object until it is properly torn down. That makes
> > the user code easier to write, but it also throws the afterwards
> > received messages away. If that message was to create a new protocol
> > object, that won't happen, the ID won't be allocated. This causes the
> > mismatch.  
> 
> [I noticed the use of "zombies" in the client, but no parallel in the
> server.  Also it seemed from a cursory analysis of the client code that
> the zombies are mostly to take care of cases that arise in multithreaded
> clients.  Is that correct?  Or are zombies related to this issue
> instead?]

I don't think zombies are related to multithreaded clients really, they
are mostly for the server/client asynchronicity.

> There is a domino effect as well: Messages involving that object as an
> argument that also have a different new_id argument have to be involved
> in the above processing as well, with delete_id sent for those new_id
> arguments.  And so on for messages involving those new_id arguments.

If zombie is an argument, then just that argument is set to null, but
the event is still delivered. Yes, it does mean that even if an event
argument is non-nullable, a client might actually see it as NULL with
the libwayland C bindings. Non-nullable applies only on sending, not
receiving.

> > 
> > Libwayland cannot automatically destroy objects either, because it has
> > no information on how to do that properly and safely. The XML does not
> > tell for sure.  
> 
> Is it a goal for libwayland to be able to at least manage object IDs
> properly on its own?  Because that would align with the needs of
> middleware.

It would have been a good goal to have...

> Also, I think it might not be too late (in backward compatibility
> terms) to do something about this.

Probably too late indeed.

> > ...
> >   
> > > 
> > > A. Ownership vs. destruction: An object created by a client can only
> > > have destructor requests, and an object created by the compositor
> > > can only have destructor events.    
> > 
> > False, violated in the wild. Examples: wl_callback created by client,
> > destroyed by server; wl_data_offer created by server, destroyed by
> > client.  
> 
> Again, that case addressed by the observation about separate ID ranges
> way above.
> 
> >   
> > > B. All destructor requests are followed by wl_display::delete_id
> > > events as acknowledgements    
> > 
> > True, as this is built in to libwayland.
> >   
> > > C. All destructor events are themselves acknowledgements of
> > > (implicit) destruction requests.    
> > 
> > Close enough I suppose, assuming the events are correctly tagged as
> > destructors. As mentioned, the tag is heavily recommended but not
> > technically mandatory for the libwayland C bindings, because it was
> > introduced several years too late.  
> 
> If the destructor tag is necessary for other language bindings, that's
> sufficient for my purposes, I think.  I can make sure I'm
> parsing protocol XML files that will work for languages that require
> destructor notation.
> 
> But, if it isn't "technically mandatory", doesn't that risk the
> introduction of protocol for which wayland-scanner does not generate
> proper code for a language that requires destructor notation?

Wayland-scanner is not used by other language bindings. They use their
own generators from XML, and they even tend to use a different
low-level libwayland ABI than what wayland-scanner targets. Using
libffi and varargs for the original C bindings turned out to be painful
for any other language.

> >   
> > > D. No object can be the target or argument of a message issued by
> > > one side after that side has issued a destructor request (explicit
> > > or implicit).    
> > 
> > Correct, I believe. An actual destructor request should be enforced
> > like that in libwayland. Allowing otherwise would defeat the point of
> > a destructor request or a non-destructor request intended to trigger a
> > destructor event. Nothing enforces the latter though.
> >   
> > > Is this correct?  Are these actual requirements that are enforced
> > > for the current protocol and future expansions?
> > > 
> > > Can there be cases of objects created by the compositor, where the
> > > compositor proposes their destruction without any prior (implicit or
> > > explicit) request to do so from the client?  If so, how are these    
> > 
> > I don't know of anything forbidding them.  
> 
> This again is the case where, as long as the ack from the client is
> what is tagged as being the "destructor", all is still OK.
> 
> > Thanks,
> > pq  
> 
> I am still hopefull, especially after realizing the benefits of the
> separate ID ranges.  The only case this leaves uncovered is object IDs
> in the compositor ID range where the compositor wants to initiate
> destruction of that object.  And I believe that case can be handled
> automatically as well, possibly even in a backward-compatible way.   The
> domimo effect case with chains of new_ids is complex, but I think there
> is a way to handle it at least semi-automatically.
> 
> But, only if destructor tags (or something similar) can be relied on in
> future versions and extensions to the wayland protocol.
> 
> My interests are in the possibility of robust middleware, but I
> think there is a convergence between three areas such that the same
> hopefully backward compatible additions yield positive results to all.
> These are: middleware, wayland-scanner bindings generation for
> languages with destructor notations, and completely managing the "meta"
> aspects of the protocol completely within libwayland.  These meta
> aspects are about proper decoding (demarhsalling), proper reuse of
> object IDs and preventing ID leakage, and the prevention of bugs
> and/or excessive complexity that might be induced by attempts to manage
> these things above libwayland. And the ability to automatically check
> for errors in these cases and prevent their spread.

This might be food for Wayland 2, or at least a libwayland2 with a
completely new scanner. And let's ditch varargs and libffi while at it!

Sorry, that was a rhetorical "us". I don't think I can personally
invest effort there at this time.

I should also mention community interest in reimplementing libwayland
in Rust. I believe upstream would be happy to sun-set the C
implementation of libwayland if there was a drop-in replacement written
in Rust, wrapped in something that offers the same libwayland C ABI.
Attempting to offer that C ABI might make developers writing Rust very
sad, I'm not sure anyone seriously tried yet.

Thanks,
pq
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/wayland-devel/attachments/20230915/f4c346fa/attachment.sig>