[Xcb] deadlock with xlib/xcb

Thu Aug 9 14:01:14 PDT 2007

Wow.  Thanks hugely for the detailed and clear bug report
and analysis!

	Bart

In message <19a3b7a80708091353x501757bsc132555156c62744 at mail.gmail.com> you wrote:
> Hi,
> 
> The following hang was discovered by Darren Salt:
> 
> Thread 7 (process 7297):
> #0  0x00002ad88209f756 in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib/libpthread.so.0
> #1  0x00002ad8847b699e in _xcb_conn_wait (c=0xf195f0, cond=0x43805df4,
>     vector=0x0, count=0xffffffffffffffff) at xcb_conn.c:296
> #2  0x00002ad8847b8405 in xcb_wait_for_reply (c=0xf195f0, request=623,
>     e=0x43805e88) at xcb_in.c:344
> #3  0x00002ad881540e7b in _XReply (dpy=0xf0b600, rep=0x43805ed0, extra=0,
>     discard=1) at ../../src/xcb_io.c:364
> #4  0x00002ad8815358da in XSync (dpy=0xf0b600, discard=0)
>     at ../../src/Sync.c:48
> <snip>
> 
> Thread 16 (process 7285):
> #0  0x00002ad88209f756 in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib/libpthread.so.0
> #1  0x00002ad8847b684b in _xcb_lock_io (c=0xf195f0) at xcb_conn.c:279
> #2  0x00002ad8847b69ac in _xcb_conn_wait (c=0xf195f0,
>     cond=<value optimized out>, vector=0x0, count=0x0) at xcb_conn.c:320
> #3  0x00002ad8847b8405 in xcb_wait_for_reply (c=0xf195f0, request=621,
>     e=0x7fff2ac5b638) at xcb_in.c:344
> #4  0x00002ad881540e7b in _XReply (dpy=0xf0b600, rep=0x7fff2ac5b680, extra=0,
>     discard=1) at ../../src/xcb_io.c:364
> #5  0x00002ad881536e84 in XTranslateCoordinates (dpy=0xf0b600,
>     src_win=39845891, dest_win=77, src_x=0, src_y=0, dst_x=0x7fff2ac5b854,
>     dst_y=0x7fff2ac5b850, child=0x7fff2ac5b848) at ../../src/TrCoords.c:53
> <snip>
> 
> Concretly the situation looks like this:
> 
> 288 int _xcb_conn_wait(xcb_connection_t *c, pthread_cond_t *cond,
> struct iovec **vector, int *count)
> 289 {
> 290     int ret;
> 291     fd_set rfds, wfds;
> 292
> 293     /* If the thing I should be doing is already being done, wait for it. */
> 294     if(count ? c->out.writing : c->in.reading)
> 295     {
> 296         pthread_cond_wait(cond, &c->iolock); // <--- Thread 16
> 297         return 1;
> 298     }
> 299
> 300     FD_ZERO(&rfds);
> 301     FD_SET(c->fd, &rfds);
> 302     ++c->in.reading;
> 303
> 304     FD_ZERO(&wfds);
> 305     if(count)
> 306     {
> 307         FD_SET(c->fd, &wfds);
> 308         ++c->out.writing;
> 309     }
> 310
> 311     _xcb_unlock_io(c);
> 312     do {
> 313         ret = select(c->fd + 1, &rfds, &wfds, 0, 0);
> 314     } while (ret == -1 && errno == EINTR);
> 315     if (ret < 0)
> 316     {
> 317         _xcb_conn_shutdown(c);
> 318         ret = 0;
> 319     }
> 320     _xcb_lock_io(c); // <--- Thread 7
> 
> What happens: Thread 7 is running normally (c->xlib.lock == 0) and
> waits a bit at line 313. Meanwhile thread 16 is scheduled
> (c->xlib.lock == 1) and waits at line 296 for thread 7 to complete its
> operation. When thread 7 reaches line 320 it can't take the lock
> because c->xlib.lock == 1 and c->xlib.thread != pthread_self() ...
> 
> 272 void _xcb_lock_io(xcb_connection_t *c)
> 273 {
> 274     pthread_mutex_lock(&c->iolock);
> 275     while(c->xlib.lock)
> 276     {
> 277         if(pthread_equal(c->xlib.thread, pthread_self()))
> 278             break;
> 279         pthread_cond_wait(&c->xlib.cond, &c->iolock);
> 280     }
> 281 }
> 
> So the next question was why this can happen at all. Let's take a look
> at _XReply:
> 
> <snip>
> 355         /* Internals of UnlockDisplay done by hand here, so that we can
> 356            insert_pending_request *after* we _XPutXCBBuffer, but before we
> 357            unlock the display. */
> 358         _XPutXCBBuffer(dpy);
> 359         current = insert_pending_request(dpy);
> 360         if(!dpy->lock || dpy->lock->locking_level == 0)
> 361                 xcb_xlib_unlock(dpy->xcb->connection); // <--- XXX
> 362         if(dpy->xcb->lock_fns.unlock_display)
> 363                 dpy->xcb->lock_fns.unlock_display(dpy);
> 364         reply = xcb_wait_for_reply(c, current->sequence, &error);
> 365         LockDisplay(dpy);
> 
> Line 361 had to be executed in thread 7 (impossible to check it, but
> seems to be the only explanation), so c->xlib.lock became 0 before
> xcb_wait_for_reply was called. However Thread 16 had
> dpy->lock->locking_level == 1 (this time verified with gdb and a
> coredump) so the "lock" wasn't released and caused a part of the
> trouble.
> 
> I have no idea were the actual bug is, but I see something like three
> possible conditions which would avoid this:
> - xlib.lock has to be released before calling xcb_wait_for_reply
> - xlib.lock must not be released before calling xcb_wait_for_reply
> - xcb has to deal with that situation internally
> 
> Hopefully you can follow my thoughts and have some nice ideas to fix this =)
> 
> Christoph
> 
> 
> PS: Hints to reproduce the issue (note that I didn't try personally):
> 
> libx11-6 1.1.3-1, libxcb* 1.0-3 (Debian)
> gxine dev, xine-lib 1.2 dev; gxine built --without-xcb
> vdr 1.4.5, vdr-xine 0.7.9 dev (local builds)
> Command: ./src/gxine vdr://tmp/vdr-xine/stream#demux:mpeg_pes
> vdr tuned to BBC News 24 (which is 16:9)
> 
> http://zap.tartarus.org/~ds/gxine-0.5.900-dev.tar.bz2
> http://zap.tartarus.org/~ds/xine-lib-1.1.90hg.tar.bz2
> _______________________________________________
> Xcb mailing list
> Xcb at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/xcb