[Xcb] Xlib/XCB in multi-threaded situation results in deadlock

Thu Mar 17 14:25:22 PDT 2011

On Tue, Mar 15, 2011 at 2:26 AM, Rami Ylimäki <rami.ylimaki at vincit.fi> wrote:
> The test case was:
>  - thread 1 waits for events (XNextEvent)
>  - thread 2 executes a request (XSync) and finishes
>  - main thread lets thread 1 to finish by mapping a window that sends an
> expose event
>
> On 03/15/2011 03:19 AM, Jamey Sharp wrote:
>> This is clearly affecting quite a few people, and needs fixing.
>>
>> I think Robert White has a point in
>>        https://bugs.freedesktop.org/show_bug.cgi?id=29561
>> A function like xcb_check_for_buffered_events may be sufficient API in
>> XCB to let Xlib solve the deadlock bug, as well as that one, using
>> select or poll.

I take this back. Josh has convinced me that having Xlib use select or
poll will only work until some thread starts using XCB directly by way
of XGetXCBConnection. Since we want to encourage people to do that, no
approach involving select/poll is viable. Rami, that means your
patches for avoiding extra reads are our only proposal now, because a
fix for the event hang won't also reduce syscalls. I've reviewed those
patches below.

My new thought is to introduce

xcb_generic_event_t *xcb_wait_for_event_until(xcb_connection_t *c,
unsigned int request);

This function returns the next event with a sequence number less than
or equal to the given sequence number, or NULL if it can prove there
is no such event. It blocks until one of those conditions is true. I
don't think there's much reason to introduce a corresponding
xcb_poll_for_event_until, but perhaps someone will give us one
eventually.

I think _XReadEvents can use this instead of xcb_wait_for_event to
solve the _XReadEvents/_XReply hang. I may have the details wrong, but
the general idea is: If there's a PendingRequest queued, use the
sequence number from that; otherwise, call require_socket and then use
dpy->request, which is the last request sent. Since we might now get
NULL for reasons other than connection shutdown, don't call _XIOError
any more, and guard the later _XIOError call with an
"if(xcb_connection_has_error(c))". Also guard the
pending_requests->reply_waiter check by first verifying that
pending_requests is non-null, which would no longer be guaranteed.

The intuition behind this proposal is that the hang occurs when
_XReadEvents doesn't know that _XReply is expecting a reply, and
_XReply is stuck waiting for _XReadEvents to prove that any earlier
responses have been processed already. So we want _XReadEvents to wake
up by the time _XReply could proceed. Using dpy->request when there is
no PendingRequest is conservative; we know that any new PendingRequest
will be later than dpy->request, so we may wake up earlier than
necessary, but if so then we'll observe a new sequence number to wait
for, and go back to sleep. And that extra wakeup only affects
multi-threaded clients, because single-threaded clients can't issue
new requests while waiting for events.

>> Rami, I'd be interested in seeing the patches you wrote about having for
>> xcb_poll_for_queued_event.
>
> The patches that I have are meant to avoid excessive reading from the
> connection and I haven't attempted to fix the event/reply deadlock with
> them. I have made two separate fixes:
>
> 1) Fix unnecessary reading from connection in _XReply and _XReadEvents
> http://gitorious.org/meego-w40/libxcb/commit/b1c1755236958d834d2e924d585fc40fd8c7d289
> http://gitorious.org/meego-w40/libx11/commit/a1f5f909f9d98755519193b5e876c6aff679d040
> http://gitorious.org/meego-w40/libx11/commit/b8784c3010bfeb5d80389dc8b7c38795237ce162

These three patches look good to me, with some edits.

First, I'd squash the two libX11 patches together. Note that after the
second patch, poll_for_event is never used, and poll_for_response is
only used in _XEventsQueued. I think you can get rid of
poll_for_response too by inserting in _XEventsQueued:

if(!dpy->xcb->next_event)
    dpy->xcb->next_event = xcb_poll_for_event(dpy->xcb->connection);

Since that will read from the socket if there weren't any events
queued yet, afterward you can just call poll_for_queued_response; and
that means poll_for_next_* are always called with queued==True, so you
can simplify away the parameter and the extra functions.

This also means at most one read in _XEventsQueued, or possibly two in
QueuedAfterFlush mode, and if we're trying to reduce syscalls I guess
that's a good thing.

On the XCB patch, a minor style nit: libxcb doesn't use namespace
prefixes on static functions.

> 2) Clean up EAGAIN errors from read when connection is polled in
> _XEventsQueued
> http://gitorious.org/meego-w40/libxcb/commit/291ddb5232280ee0dbd677d6dca759ac86cbd827
> http://gitorious.org/meego-w40/libx11/commit/86d4a83cfd39765dd7c463d353aa22e468174eb4
> http://gitorious.org/meego-w40/libx11/commit/ef09c42c9c7effcf8f539606d2e7d41b44bc0eb7

So... instead of doing one syscall, now we'll do two or three for the
same effect, just to avoid having syscall traces show EAGAIN? These
patches seem like a cure that's worse than the disease. Unless you've
got some really good argument for this, I'm not inclined to merge
these three.

Jamey