[Xcb] [PATCH] fix deadlock with xcb_take_socket/return_socket v2

Wed May 15 01:47:38 PDT 2013

Hi Jamey,

well this misunderstanding you describe here is actually just one more 
problem with those two bugreports:

Bugreport https://bugs.freedesktop.org/show_bug.cgi?id=20708 and 
bugreport https://bugs.freedesktop.org/show_bug.cgi?id=30450 are about 
two completely different bugs!!! So setting bug 20708 as a duplicate of 
bug 30450 was actually incorrect.

The bug originally described in 20708 (and also what the original test 
program clearly exercise) is a deadlock because of acquiring locking 
primitives in different orders.

The bug described in 30450 seems to be some kind of race condition while 
reading events/replies from the input queue and indeed seems to be what 
you described in your mail.

The problem is that the test program(s) on bug 20708 can also trigger 
bug 30450, so while debugging this you guys triggered bug 30450 and 
thought it must be the same thing (I initially did the same mistake and 
it took me a week to figure that out correctly).

> libX11 carefully drops the Display lock around the xcb_generate_id calls
> that could result in XCB trying to write, and as far as I can tell it
> doesn't ask XCB to send requests any other time.

Yeah but nobody prevents other XCB users (like the OpenGL and/or VDPAU 
implementation in mesa) to be called with the display-lock held. 
Additional to that even libX11 calls xcb_take_socket with the 
displaylock held (which makes perfectly sense, cause otherwise another 
thread might directly steal the socket again), so in this code path you 
have displaylock->iolock, while in the return code path you have 
iolock->displaylock. The only correct solution I can see to this is to 
drop the iolock while calling the return code path.

> I'd love to see proposals for fixing this!

Yeah, I think that this one is even harder to fix.

Currently my testcase only triggers the deadlock fixed with this patch 
(pretty complicated stuff involving XBMC, VDPAU and mesas OpenGL 
implementation). But I'm pretty sure that once I get this one upstream 
we are going to run into the other one also.

Regards,
Christian.

Am 14.05.2013 21:45, schrieb Jamey Sharp:
> Sorry I didn't reply sooner, and thanks for looking into this!
>
> I'm not convinced you have the right diagnosis, though:
>
> If a thread owns the XCB socket, it should not ask XCB to write on that
> socket. The whole point of socket ownership is that XCB won't write
> without taking the socket back. (xcb_take_socket's documentation says
> this, but probably should be more clear.)
>
> libX11 carefully drops the Display lock around the xcb_generate_id calls
> that could result in XCB trying to write, and as far as I can tell it
> doesn't ask XCB to send requests any other time.
>
> Therefore, barring further evidence, I believe this note from commit
> 933aee1d5c53b0cc7d608011a29188b594c8d70b is still the correct
> explanation for https://bugs.freedesktop.org/show_bug.cgi?id=30450 .
>
> - If one thread is waiting for events and another thread tries to read a
>    reply, both will hang until an event arrives. Previously, if this
>    happened it might work sometimes, but otherwise would trigger either
>    an assertion failure or a permanent hang.
>
> That's purely a libX11 bug, but may need new API in XCB to maintain all
> the invariants. See the comment introduced in that commit for context:
>
> /* Thread-safety rules:
>   *
>   * At most one thread can be reading from XCB's event queue at a time.
>   * If you are not the current event-reading thread and you need to find
>   * out if an event is available, you must wait.
>   *
>   * The same rule applies for reading replies.
>   *
>   * A single thread cannot be both the the event-reading and the
>   * reply-reading thread at the same time.
>   *
>   * We always look at both the current event and the first pending reply
>   * to decide which to process next.
>   *
>   * We always process all responses in sequence-number order, which may
>   * mean waiting for another thread (either the event_waiter or the
>   * reply_waiter) to handle an earlier response before we can process or
>   * return a later one. If so, we wait on the corresponding condition
>   * variable for that thread to process the response and wake us up.
>   */
>
> And the actual bug is commented in that commit too:
>
> /* If some thread is already waiting for events,
>   * it will get the first one. That thread must
>   * process that event before we can continue. */
> /* FIXME: That event might be after this reply,
>   * and might never even come--or there might be
>   * multiple threads trying to get events. */
>
> My intuition is that new XCB API allowing callers to retrieve the next
> response, regardless of whether it's an event, error, or reply, might
> solve this bug. But I haven't had time to work out the details since
> writing that commit.
>
> I'd love to see proposals for fixing this!
>
> Jamey