[Xcb] Xlib/XCB in multi-threaded situation results in deadlock

Fri Mar 18 08:08:59 PDT 2011

On 03/17/2011 11:25 PM, Jamey Sharp wrote:
> I take this back. Josh has convinced me that having Xlib use select or
> poll will only work until some thread starts using XCB directly by way
> of XGetXCBConnection. Since we want to encourage people to do that, no
> approach involving select/poll is viable. Rami, that means your
> patches for avoiding extra reads are our only proposal now, because a
> fix for the event hang won't also reduce syscalls. I've reviewed those
> patches below.

Ok.

> My new thought is to introduce
>
> xcb_generic_event_t *xcb_wait_for_event_until(xcb_connection_t *c,
> unsigned int request);
>
> This function returns the next event with a sequence number less than
> or equal to the given sequence number, or NULL if it can prove there
> is no such event. It blocks until one of those conditions is true. I
> don't think there's much reason to introduce a corresponding
> xcb_poll_for_event_until, but perhaps someone will give us one
> eventually.
>
> I think _XReadEvents can use this instead of xcb_wait_for_event to
> solve the _XReadEvents/_XReply hang. I may have the details wrong, but
> the general idea is: If there's a PendingRequest queued, use the
> sequence number from that; otherwise, call require_socket and then use
> dpy->request, which is the last request sent. Since we might now get
> NULL for reasons other than connection shutdown, don't call _XIOError
> any more, and guard the later _XIOError call with an
> "if(xcb_connection_has_error(c))". Also guard the
> pending_requests->reply_waiter check by first verifying that
> pending_requests is non-null, which would no longer be guaranteed.
>
> The intuition behind this proposal is that the hang occurs when
> _XReadEvents doesn't know that _XReply is expecting a reply, and
> _XReply is stuck waiting for _XReadEvents to prove that any earlier
> responses have been processed already. So we want _XReadEvents to wake
> up by the time _XReply could proceed. Using dpy->request when there is
> no PendingRequest is conservative; we know that any new PendingRequest
> will be later than dpy->request, so we may wake up earlier than
> necessary, but if so then we'll observe a new sequence number to wait
> for, and go back to sleep. And that extra wakeup only affects
> multi-threaded clients, because single-threaded clients can't issue
> new requests while waiting for events.

Makes sense to me. To sum up, my original patches tried to determine 
that there is no need to wait for events in _XReply at the expence of 
exposing sequence numbers from Xcb. Your approach is definitely cleaner 
and solves the problem at the expence of waking up at _XReadEvents when 
some sequence number is seen and letting _XReply to continue. I don't 
see why that wouldn't work nicely.

>>> Rami, I'd be interested in seeing the patches you wrote about having for
>>> xcb_poll_for_queued_event.
>> The patches that I have are meant to avoid excessive reading from the
>> connection and I haven't attempted to fix the event/reply deadlock with
>> them. I have made two separate fixes:
>>
>> 1) Fix unnecessary reading from connection in _XReply and _XReadEvents
>> http://gitorious.org/meego-w40/libxcb/commit/b1c1755236958d834d2e924d585fc40fd8c7d289
>> http://gitorious.org/meego-w40/libx11/commit/a1f5f909f9d98755519193b5e876c6aff679d040
>> http://gitorious.org/meego-w40/libx11/commit/b8784c3010bfeb5d80389dc8b7c38795237ce162
> These three patches look good to me, with some edits.
>
> First, I'd squash the two libX11 patches together. Note that after the
> second patch, poll_for_event is never used, and poll_for_response is
> only used in _XEventsQueued. I think you can get rid of
> poll_for_response too by inserting in _XEventsQueued:
>
> if(!dpy->xcb->next_event)
>      dpy->xcb->next_event = xcb_poll_for_event(dpy->xcb->connection);

I see, we can pretty much just extract the _XEventsQueued specific 
functionality out of poll_for_event.

> Since that will read from the socket if there weren't any events
> queued yet, afterward you can just call poll_for_queued_response; and
> that means poll_for_next_* are always called with queued==True, so you
> can simplify away the parameter and the extra functions.
>
> This also means at most one read in _XEventsQueued, or possibly two in
> QueuedAfterFlush mode, and if we're trying to reduce syscalls I guess
> that's a good thing.
>
> On the XCB patch, a minor style nit: libxcb doesn't use namespace
> prefixes on static functions.
>

Thanks for the comments, I'll prepare new versions of those patches.

>> 2) Clean up EAGAIN errors from read when connection is polled in
>> _XEventsQueued
>> http://gitorious.org/meego-w40/libxcb/commit/291ddb5232280ee0dbd677d6dca759ac86cbd827
>> http://gitorious.org/meego-w40/libx11/commit/86d4a83cfd39765dd7c463d353aa22e468174eb4
>> http://gitorious.org/meego-w40/libx11/commit/ef09c42c9c7effcf8f539606d2e7d41b44bc0eb7
> So... instead of doing one syscall, now we'll do two or three for the
> same effect, just to avoid having syscall traces show EAGAIN? These
> patches seem like a cure that's worse than the disease. Unless you've
> got some really good argument for this, I'm not inclined to merge
> these three.
>

I never expected those patches to be accepted upstream and I'm not going 
to defend them :). They were done just for strace cleanup so that system 
analysis would be easier.

-- Rami