Wayland Message Behaviour

Thu Nov 9 08:29:13 UTC 2017

On Wed, 8 Nov 2017 16:02:11 +0000
"Turner, LewisX" <lewisx.turner at intel.com> wrote:

> Hi all,
> 
> I have come across some strange behavior with Wayland and I wondered
> if you could share your thoughts on it.
> 
> The trace below is a sanitized version of what I captured of the
> behavior of Wayland messages between our client and Weston. Towards
> the end of the trace, we see an encoded buffer handle message sent
> out once but for some reason the client ends up receiving this
> message twice. At this point, the bridge between Weston and the
> client is broken and although both can send messages to each other,
> the messages themselves are never getting to their destinations.
> 
> We only see this after calling wl_client_flush() within Weston
> immediately after sending a message to a client. On top of this, the
> client can run at 60FPS for anywhere between 2-30 seconds before
> encountering it. The wl_client_flush() is called immediately after
> the call to encoded_buffer_handle() on every frame. There is no
> obvious difference between the frame on which the issue occurs and
> the other frames that work fine.
> 

> [472508.841]  -> hmi at 3.encoded_buffer_handle(10, 2073600, 409148, 8, 3559793324)
> [472509.269] hmi @3.encoded_buffer_handle(10, 2073600, 409148, 8, 3559793324)
> [472517.952]  -> hmi @3.release_buffer_handle(8, 0, 1)
> [472518.146] hmi @3.release_buffer_handle(8, 0, 1)
> [473005.451]  -> hmi @3.encoded_buffer_handle(10, 2073600, 427086, 8, 3559838348)
> [473005.848] hmi @3.encoded_buffer_handle(10, 2073600, 427086, 8, 3559838348)
> [473014.914]  -> hmi @3.release_buffer_handle(8, 0, 1)
> [473015.112] hmi @3.release_buffer_handle(8, 0, 1)
> [473505.149]  -> hmi @3.encoded_buffer_handle(10, 2073600, 558179, 8, 3559883345)
> [473506.794] hmi @3.encoded_buffer_handle(10, 2073600, 558179, 8, 3559883345)
> [473517.277]  -> hmi @3.release_buffer_handle(8, 0, 1)
> [473517.427] hmi @3.release_buffer_handle(8, 0, 1)
> [473517.524] hmi @3.encoded_buffer_handle(10, 2073600, 558179, 8, 3559883345)
> [474002.087]  -> hmi @3.encoded_buffer_handle(10, 2073600, 400377, 8, 3559928007)
> [474507.144]  -> hmi @3.encoded_buffer_handle(7, 2073600, 391711, 4, 3559972997)
> [475002.930]  -> hmi @3.encoded_buffer_handle(11, 2073600, 399621, 5, 3560018014)
> [475503.604]  -> hmi @3.encoded_buffer_handle(12, 2073600, 408610, 6, 3560063010)
> [476002.867]  -> hmi @3.encoded_buffer_handle(13, 2073600, 422290, 7, 3560107997)
> 

Hi,

ok, I see the duplicated message. It could be a bug in libwayland.

How exactly did you get this trace? It's not obvious which one is client
and which one is server as I don't know your protocol extension.

Have you tried a middle-man tool to print the protocol traces?
See https://wayland.freedesktop.org/extras.html and please read the "is
not enough" link for the caveats of WAYLAND_DEBUG dumping.

Explicitly calling wl_client_flush() should never be necessary in
Weston, because wl_display_run() does it internally when control returns
to the main event loop. Maybe you are not returning the main event loop
often enough? Or is the server side using wl_resource_queue_event()?
But that is also irrelevant to the message duplication.

> Also, when this happens it's worth nothing that
> wl_display_roundtrip() reports an error, and wl_display_get_error()
> returns an error code. The documentation online suggests errors here
> are fatal and we cannot use the display again.

Yes, it is fatal. Have you hooked up the client side logging from
libwayland? That should print the error, but the error should also be
present in the protocol dump if it is a protocol error. But it might be
a socket error as well.

> Also note that commenting out the call to wl_client_flush() makes the
> issue go away. However, we then see an unacceptable delay before the
> event is received on the client side.

That seems like an unrelated issue worth looking into as well. Weston
should be quite fast in returning to its main event loop.

Are you sure you are not calling wl_client_flush() on a destroyed
wl_client, btw?

If you ever store a wl_client pointer anywhere, you must also register
a destroy listener for it, because libwayland-server will automatically
destroy clients, e.g. due to protocol or socket errors.

> The main question at this point is whether anyone else has seen
> similar behavior. We do not want to spend more time investigating
> this if the cause is a known issue.

No, I have never seen a duplicated message before.

Maybe it's triggered by a failure path, since you say the connection
fails.

A common way to cause arbitrary connection failures is a client (or
server) that never throttles. A client that continuously spams requests
without waiting for a single event from the compositor in between is
likely to overflow the message buffers, causing a fatal error and
disconnection. You will not know about the error until you call a
client function that can return it, e.g. roundtrip, dispatch, etc.
Request sending functions never return errors even if they raise one,
because it would be far too much for the caller to check every request
call.

Now, if anything I said was news to you, I would be happy to hear
suggestions on where you would have expected the thing to have been
documented. I know our documentation is somewhat sparse.

Thanks,
pq
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/wayland-devel/attachments/20171109/7dff8944/attachment.sig>