[Xcb] deadlock with xlib/xcb

Sun Oct 28 23:12:08 PDT 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Christoph and I spent a couple hours on IRC today and we believe we
fixed this deadlock.

The deadlock was caused by holding the xcb_xlib_lock while sleeping.
Here's one trace that fails.

Thread 1, with xlib.lock == 0, enters _xcb_conn_wait and calls
_xcb_unlock_io.

Thread 2 finishes _xcb_lock_io and sets xlib.lock = 1.

Thread 2 calls _xcb_conn_wait, but must block because Thread 1 is
already in select(). So Thread 2 calls pthread_cond_wait, waiting for
Thread 1 to finish, which releases the I/O lock. However, xlib.lock is
still 1.

Thread 1 finishes select() and enters _xcb_lock_io. It takes the I/O
lock successfully, but then notices that xlib.lock is 1 and xlib.thread
is not the current thread, so puts itself back to sleep, waiting for
Thread 2 to release the Xlib lock.

Now Thread 1 and Thread 2 are each waiting for each other and can't make
further progress.

Important note: whenever Xlib asks XCB to do I/O, Xlib must be prepared
for its locks to be dropped. This was true before and remains true after
this patch. I was going to claim that Xlib correctly handles the
potential dropped locks, but I just realized that at least the
process_responses function probably doesn't. This needs to be inspected.

The fix I implemented was to make pthread_cond_wait calls on the I/O
lock also drop and restore the xcb_xlib_lock. In the case where the Xlib
lock is held, the pseudocode for the new _xcb_wait_io function is this:

  xcb_xlib_unlock
  sleep
  xcb_xlib_lock

However, the final pthread_mutex_unlock in xcb_xlib_unlock needs to be
atomic with the sleep; and the initial pthread_mutex_lock and final
pthread_mutex_unlock of this sequence need to be skipped. For these
reasons, and because the xcb_xlib_lock and xcb_xlib_unlock symbols are
not visible inside libxcb, I inlined the needed parts into _xcb_wait_io.

In addition, when _xcb_conn_wait drops the I/O lock around the select()
call, it needs to drop and restore the xcb_xlib_lock there too.
Otherwise ico can trigger one of the double-unlock assertion failures
that have been driving folks nuts. ;-)

The above fix prevents XCB from following a "hold and wait" policy,
thereby breaking one of the necessary conditions for deadlock. Christoph
also proposed breaking another condition instead, namely "circular
wait". If, after the select() call, _xcb_conn_wait acquires the I/O lock
without waiting for the Xlib lock to be free, then that should also
solve the deadlock. Traditional Xlib actually works this way in its
equivalent code: it can acquire the Display lock even though other
threads have asked it not to, under some circumstances. However, we
concluded that this approach would be difficult to implement in XCB. We
didn't get so far as to discuss whether it would be a good idea. :-)

Christoph asked why we have this awful Xlib-specific lock inside XCB
anyway. In the X protocol, a request sent to the server does not contain
its sequence number--the client and server both count how many requests
have gone by, instead. So allocating a sequence number and allocating
output buffer space must happen together, atomically.

Xlib allocates its sequence numbers in a macro, GetReq (and variations),
which we can't change because its definition is published in XlibInt.h,
which makes it effectively public API for Xlib. That means that when
Xlib's "Display" structure is locked with LockDisplay, it must have the
currently valid sequence number in it, and that sequence number must
remain valid until UnlockDisplay.

So in Xlib/XCB, LockDisplay asks XCB what the current sequence number
is, and locks out XCB from allocating sequence numbers or buffer space
until UnlockDisplay. In other words, the xlib-xcb lock is to prevent XCB
from allocating new sequence numbers or output buffer space while Xlib
*might* be trying to do the same thing. This was the smallest API I
could come up with to make Xlib work.

I had another plan that eliminates all this complexity, but nobody would
let me get away with it. If Xlib is allowed to lie to its clients about
what sequence numbers were on the wire then this is a lot simpler. :-)
Xlib can allocate contiguous sequence numbers, and then internally keep
a mapping of what they actually were when they get sent later, and map
events and such back to the fake numbers as they return. That means that
Xlib-using code can free-run alongside XCB-using code without locks or
serialization, and that XCB doesn't need any special Xlib-specific API.
But keithp and Bart vetoed that plan. I think Keith was opposed to
letting Xlib lie about sequence numbers, and Bart says he's opposed to
adding the complexity of the mapping to Xlib.

Christoph asked if we could have simplified the problem using more
finely-grained locking, namely by locking only the sequence number
generation or something. Perhaps the I/O lock is too coarse? This may be
worth considering, but I haven't put enough thought into it yet.

christoph4: and then i'll soon be able to give you a locking medal
jameysharp: for me? awww.
christoph4: jep ;)
christoph4: the code is becoming complex enough to honor it that way ...

Jamey
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHJXlvp1aplQ4I9mURAkWBAJ4rbq3+25uM2tRYYSQ9xoK9AyynBwCghpKO
KjNEwibAaAF9lhKTKNkIiZU=
=fkyB
-----END PGP SIGNATURE-----