Threading problems in (Win)DBus

Mon Jul 9 08:00:09 PDT 2007

2007/7/9, Olivier Hochreutiner <olivier.hochreutiner at gmail.com>:
> 2007/7/7, Havoc Pennington <hp at redhat.com>:
> > Hi,
> >
> >  From the traces, here two different threads take the lock at the same
> > time (apparently?):
> >
> > 10357:   LOCK: _dbus_connection_acquire_dispatch [3082640272]
> > 10357: 10357: lock socket_do_iteration post poll
> > 10357:   LOCK: _dbus_connection_lock [3082643152]
> >
> > And then here also:
> >
> > 11703:   LOCK: dbus_connection_send_with_reply [3082446544]
> > 11703: Allocated slot 0 on allocator 0xb7eb583c total 1 slots allocated
> > 1 used
> > 11703:   LOCK: dbus_connection_get_is_connected [3082443664]
> >
> > So that should not be possible, right? Unless the implementation of
> > locking is messed up somehow (which it may well be) or the locks are
> > no-ops since threads are for some reason not enabled.
>
> As of Christian's second post, threading was disabled in the traces
> above. Now that he enabled it, he has the exact same behaviour under
> Linux I have under Win32: when traces are enabled, no assert fails,
> and when traces are disabled, an assert fails after 10-100 iterations
> of the while loop in Paolo's example. The failing assert occurs in
> several different places, but it is always
> "connection->have_connection_lock" or "connection->io_path_acquired".
>
> I tried Peter's patch to test if threading is really enabled, and it
> is (in case you still doubted ;-)
>
> Also note that it seems the bug can be reproduced much easier (if not
> only ?) on fast CPUs. But that does not help much...
>
> All the facts we have till now makes me think that memory corruption /
> buffer overflow are involved in this problem, for the following
> reasons:

OK, the problem is memory corruption, I now have a proof:

if I insert:

  char avoidoverflow[10000];

in the definition of "struct DBusConnection" anywhere between:

  unsigned int io_path_acquired : 1;

and

  unsigned int have_connection_lock : 1;

the problem disappears. I think this clearly means that
have_connection_lock (or sometimes io_path_acquired; see asserts I
reported) is overwritten by an overflow. Now the 1-billion-dollar
question: where does the overflow (or underflow, or corrupted pointer)
occur ?

Best,

Olivier