additional race in the wayland protocols?

Mon Mar 25 10:21:59 UTC 2019

On Sun, 24 Mar 2019 22:26:57 +0100
Eugen Friedrich <friedrix at gmail.com> wrote:

> Hi all,
> 
> There is race condition which produces the error:
> 
> not a valid new object id (4278190081), message data_offer(n)
> 
> use-case:
> 1. mutli-threaded client: several threads dispatching and flushing the
> wayland connection
> 2. server-side objects are used in the protocol (e.g.
> linux-dmabuf-unstable-v1.xml used for buffers)
> 3. server-side objects are destroyed and created quite frequently (not
> mandatory for the error but will increase the reproducibility).
> 
> To understand why the error occurs we just need to take a look to the
> generated code of the buffer destroy event:
> 
> static inline void
> wl_buffer_destroy(struct wl_buffer *wl_buffer)
> {
>         wl_proxy_marshal((struct wl_proxy *) wl_buffer,
>                 WL_BUFFER_DESTROY);
>         wl_proxy_destroy((struct wl_proxy *) wl_buffer);
> }
> 
> if the thread(T1) executing the wl_buffer_destroy call will be
> interrupted due to the scheduling between the wl_proxy_marshal call
> and wl_proxy_destroy calls and another thread(T2) will dispatch the
> connect, the buffer destroy event will be send to the compositor and
> if also request to create a new buffer will be in the queue
> server could just reuse the destroyed object id wayland ,
> Then dispatching of the event with new object id on the client site
> needs to happen before T1 had change to run. in this case we will see
> that the slot in the server map object on the cleint side is still not
> NULL because wl_proxy_destroy call is still not executed.
> 
> This race is quite rare but we faced this in the real project.

Hi Eugen,

yes, all that sounds plausible. Thank you for tracking that down, it
must have been nasty.

> To fix this we could create the proxy wrapper first from the object to
> be destroyed change the order of the wl_proxy_destroy and
> wl_proxy_marshal calls where wl_proxy_marshal will use the wrapper
> object and then destroy the wrapper object.
> 
> What do you think?

Sounds like a hack kludge, and I'm not even sure it would be legal.

My immediate thought would be to extend libwayland-client API with a
new family of functions wl_proxy_marshal_destroy() which would combine
wl_proxy_marshal() and wl_proxy_destroy() into the same function call
that can then hold the necessary locks over both operations.

wayland-scanner's code generation need to be changed and all client
side application code recompiled in any case, so the hack would have
the same cost while being an awkward solution if even possible.

A similar reason caused the introduction of wl_proxy_marshal_constructor():
https://gitlab.freedesktop.org/wayland/wayland/commit/853c24e6998f747150e4233cf41bfa8268964cc2

In hindsight, it is kind of obvious that we would need the same with
destructors too.

I would recommend to file a Wayland issue in Gitlab for this, so we can
track it.

I would also like to see if we can forbid requests that both destroy
the protocol object and create another one in one message. This would
avoid the need for wl_proxy_marshal_constructor_destroy(). Marshalling
is already getting a little bit combinatorial with constructor,
versioned, and array.

Thanks,
pq
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/wayland-devel/attachments/20190325/d6465cbe/attachment.sig>