weird Xwayland and compositor deadlock issue [WAS: [PATCH xserver v2] xwayland: handle EAGAIN and EINTR gracefully]

Tue Sep 13 16:04:14 UTC 2016

Hi Pekka,

----- Original Message -----
> Hi Olivier,
> 
> I don't have any solution for you. The interactions between the Wayland
> compositor and Xwayland are known to be very easily deadlockable IIRC. I
> believe the only thing you can do is ensure no such case can ever
> occur, which is very painful. That is, never do a blocking roundtrip at
> least from one side.
> 
> Have the recent modifications caused a significant increase of Wayland
> requests from Xwayland? If Xwayland needs to send an amount of data
> bigger than bufferable, *any* blocking roundtrip via X11 from the
> Wayland compositor is prone to deadlock. It will be waiting for a reply
> via X11, while Xwayland is blocked on flushing, since the Wayland
> compositor is not consuming requests.
> 
> It can also trivially happen if both sides do a blocking roundtrip at
> the same time. Or just a wait for an event.
> 
> Either server needs to be able to return to its main loop to process the
> protocol stream it is the server for. Preferably both, I think.

Unfortunately, any XSync (like, for example, called in gdk_error_trap_pop() in gdk) will issue a blocking roundtrip, and window managers tend to do that quite a lot (some more than others) so I don't think we can easily chaneg that in window managers to avoid blocking rountrips on X11 side.

> You could check how Weston's XWM works. I highly suspect that after
> Xwayland launch it avoids doing any blocking roundtrips via X11.

Yet sometimes some X calls are blocking, e.g. XShapeGetRectangles() or even XGetWindowAttributes() which is invoked by mutter each time the a new window is mapped. mutter still uses Xlib and not xcb.

> I'd assume Xwayland also tries to avoid blocking on Wayland events, but
> if nothing else, I believe Mesa via GLAMOR may block on
> wl_buffer.release events... or maybe not if GLAMOR is smart with its
> throttling. Anyway, since your flush is hitting EAGAIN, that doesn't
> seem to be the cause.
> 
> I wonder if making wl_display_flush() block immediately like in your
> patch could be replaced by adding the wl_display fd to the main poll
> loop, so that it would get flushed ASAP but still service X11 requests
> in the mean time? It does run the risk of overflowing the Wayland send
> buffer in Xwayland. Any way to prioritize the Wayland compositor's X11
> connection in Xwayland?

If I don't make EAGAIN a FatalError() and wait for the Wayland display file descriptor to become writable again, Xwayland eventually dies with another error "(EE) request could not be marshaled: can't send file descriptor" from libwayland directly (in copy_fds_to_connection()).

So I am at lost...

Cheers,
Olivier