[Xcb] deadlock with xlib/xcb

Thu Aug 9 13:53:50 PDT 2007

Hi,

The following hang was discovered by Darren Salt:

Thread 7 (process 7297):
#0  0x00002ad88209f756 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/libpthread.so.0
#1  0x00002ad8847b699e in _xcb_conn_wait (c=0xf195f0, cond=0x43805df4,
    vector=0x0, count=0xffffffffffffffff) at xcb_conn.c:296
#2  0x00002ad8847b8405 in xcb_wait_for_reply (c=0xf195f0, request=623,
    e=0x43805e88) at xcb_in.c:344
#3  0x00002ad881540e7b in _XReply (dpy=0xf0b600, rep=0x43805ed0, extra=0,
    discard=1) at ../../src/xcb_io.c:364
#4  0x00002ad8815358da in XSync (dpy=0xf0b600, discard=0)
    at ../../src/Sync.c:48
<snip>

Thread 16 (process 7285):
#0  0x00002ad88209f756 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/libpthread.so.0
#1  0x00002ad8847b684b in _xcb_lock_io (c=0xf195f0) at xcb_conn.c:279
#2  0x00002ad8847b69ac in _xcb_conn_wait (c=0xf195f0,
    cond=<value optimized out>, vector=0x0, count=0x0) at xcb_conn.c:320
#3  0x00002ad8847b8405 in xcb_wait_for_reply (c=0xf195f0, request=621,
    e=0x7fff2ac5b638) at xcb_in.c:344
#4  0x00002ad881540e7b in _XReply (dpy=0xf0b600, rep=0x7fff2ac5b680, extra=0,
    discard=1) at ../../src/xcb_io.c:364
#5  0x00002ad881536e84 in XTranslateCoordinates (dpy=0xf0b600,
    src_win=39845891, dest_win=77, src_x=0, src_y=0, dst_x=0x7fff2ac5b854,
    dst_y=0x7fff2ac5b850, child=0x7fff2ac5b848) at ../../src/TrCoords.c:53
<snip>

Concretly the situation looks like this:

288 int _xcb_conn_wait(xcb_connection_t *c, pthread_cond_t *cond,
struct iovec **vector, int *count)
289 {
290     int ret;
291     fd_set rfds, wfds;
292
293     /* If the thing I should be doing is already being done, wait for it. */
294     if(count ? c->out.writing : c->in.reading)
295     {
296         pthread_cond_wait(cond, &c->iolock); // <--- Thread 16
297         return 1;
298     }
299
300     FD_ZERO(&rfds);
301     FD_SET(c->fd, &rfds);
302     ++c->in.reading;
303
304     FD_ZERO(&wfds);
305     if(count)
306     {
307         FD_SET(c->fd, &wfds);
308         ++c->out.writing;
309     }
310
311     _xcb_unlock_io(c);
312     do {
313         ret = select(c->fd + 1, &rfds, &wfds, 0, 0);
314     } while (ret == -1 && errno == EINTR);
315     if (ret < 0)
316     {
317         _xcb_conn_shutdown(c);
318         ret = 0;
319     }
320     _xcb_lock_io(c); // <--- Thread 7

What happens: Thread 7 is running normally (c->xlib.lock == 0) and
waits a bit at line 313. Meanwhile thread 16 is scheduled
(c->xlib.lock == 1) and waits at line 296 for thread 7 to complete its
operation. When thread 7 reaches line 320 it can't take the lock
because c->xlib.lock == 1 and c->xlib.thread != pthread_self() ...

272 void _xcb_lock_io(xcb_connection_t *c)
273 {
274     pthread_mutex_lock(&c->iolock);
275     while(c->xlib.lock)
276     {
277         if(pthread_equal(c->xlib.thread, pthread_self()))
278             break;
279         pthread_cond_wait(&c->xlib.cond, &c->iolock);
280     }
281 }

So the next question was why this can happen at all. Let's take a look
at _XReply:

<snip>
355         /* Internals of UnlockDisplay done by hand here, so that we can
356            insert_pending_request *after* we _XPutXCBBuffer, but before we
357            unlock the display. */
358         _XPutXCBBuffer(dpy);
359         current = insert_pending_request(dpy);
360         if(!dpy->lock || dpy->lock->locking_level == 0)
361                 xcb_xlib_unlock(dpy->xcb->connection); // <--- XXX
362         if(dpy->xcb->lock_fns.unlock_display)
363                 dpy->xcb->lock_fns.unlock_display(dpy);
364         reply = xcb_wait_for_reply(c, current->sequence, &error);
365         LockDisplay(dpy);

Line 361 had to be executed in thread 7 (impossible to check it, but
seems to be the only explanation), so c->xlib.lock became 0 before
xcb_wait_for_reply was called. However Thread 16 had
dpy->lock->locking_level == 1 (this time verified with gdb and a
coredump) so the "lock" wasn't released and caused a part of the
trouble.

I have no idea were the actual bug is, but I see something like three
possible conditions which would avoid this:
- xlib.lock has to be released before calling xcb_wait_for_reply
- xlib.lock must not be released before calling xcb_wait_for_reply
- xcb has to deal with that situation internally

Hopefully you can follow my thoughts and have some nice ideas to fix this =)

Christoph

PS: Hints to reproduce the issue (note that I didn't try personally):

libx11-6 1.1.3-1, libxcb* 1.0-3 (Debian)
gxine dev, xine-lib 1.2 dev; gxine built --without-xcb
vdr 1.4.5, vdr-xine 0.7.9 dev (local builds)
Command: ./src/gxine vdr://tmp/vdr-xine/stream#demux:mpeg_pes
vdr tuned to BBC News 24 (which is 16:9)

http://zap.tartarus.org/~ds/gxine-0.5.900-dev.tar.bz2
http://zap.tartarus.org/~ds/xine-lib-1.1.90hg.tar.bz2