Announcing new version of the Qt4 bindings

Mon Feb 13 23:35:58 PST 2006

Hi, 

Thanks,

Lots of historical discussion on the deadlock issue, most recently:
http://lists.freedesktop.org/archives/dbus/2004-October/001665.html

I'll also include some notes from memory, I may misremember some
details...

On Mon, 2006-02-13 at 23:41 +0100, Thiago Macieira wrote:
> Well, I've found one: the old deadlock problem.
> 
> If a function, which has been called as a result of signal delivery or 
> reply delivery, tries to place a new blocking call, the whole application 
> deadlocks.
> 
> The reason for that is that it dbus_connection_dispatch tries to acquire 
> the dispatcher while it's already locked by the same thread.

I think there are two partially separable issues; one is the mutex
deadlock, since the dbus API doesn't have explicit locking I would say
any mutex deadlock is probably a bug. Two though is the bigger-picture
"old deadlock problem" as you say, where process A makes a blocking call
to process B, and process B then makes a blocking call to process A, and
both block on the socket waiting for the other to send a reply message.

Regarding the mutex deadlock, I have this in a Doxygen comment:

 * @todo FIXME what if we call out to application code to handle a
 * message, holding the dispatch lock, and the application code runs
 * the main loop and dispatches again? Probably deadlocks at the
 * moment. Maybe we want a dispatch status of DBUS_DISPATCH_IN_PROGRESS,
 * and then the GSource etc. could handle the situation? Right now
 * our GSource is NO_RECURSE

As this @todo hints the fix I'd like to see is that dispatch becomes a
no-op in this case, but it needs to return a special status so the main
loop can cope.

With GLib we are using this or at least used to be:
http://developer.gnome.org/doc/API/2.0/glib/glib-The-Main-Event-Loop.html#g-source-set-can-recurse

But this is a fragile thing to rely on, since people can access
DBusConnection directly without going through the main loop, so really
we need to trap the recursion inside DBusConnection itself.

I'm guessing this is not very hard but it does require an API change so
should probably be moved to the 1.0 blocker list in the TODO file.

I'm not sure how this works out with the Qt main loop so your advice is
welcome there.

btw there are other @todo items:
http://dbus.freedesktop.org/doc/api/html/todo.html
I have not recently scanned those for "1.0 blockers," most of them
aren't important but a couple may be.

Moving on to the larger issue of cross-process deadlock on the socket,
I think dbus is basically "policy free" on this and allows bindings to
choose from among the possible tradeoffs, though it could offer
additional support for a DCOP-style approach.

Some history, with ORBit all blocking calls simply ran the main loop
including dispatch of other incoming CORBA calls; my view is that this
was the proverbial Bad Idea, it's sort of like writing threaded code,
only without mutexes ;-) Nautilus (which heavily used ORBit) used to
crash all the time due to this - always in a difficult-to-reproduce way.

DCOP I gather has a slightly more clever approach, which is tracking an
ID for the current "call stack" and allowing reentrancy only within that
stack. This is more difficult with dbus because dbus calls/replies are
not synchronous and the "call stack id" would have to be passed around
in a bunch of places in the API, rather than kept implicitly as global
state. Waldo's message and surrounding thread I linked to above
discusses this some.

My opinion (which is not enforced by libdbus) is that most cures we've
collectively tried or thought of are worse than the disease. If we
simply disallow reentrancy while in a blocking call, any deadlock a)
should happen reliably and reproducibly and b) should be pretty easy to
diagnose with a debugger. The fix is also pretty simple, which is (worst
case) to convert one or the other call to an asynchronous call instead
of a blocking call. Also, the deadlock can be broken by killing either
process.

With dbus there are two ways to do an async call, one is to create a
thread for it and do a blocking call in the thread, and the other is
callback-based (you can send a call and register a callback to get the
reply). For the special case where you don't need to know the return
value or whether the call completed, you have a third option which is
"fire and forget" (for signals, this is the norm).

If someone wanted to experiment with the limited-reentrancy approach as
in DCOP, I think it would be interesting to get more knowledge on how it
would play out in the dbus API. Potentially this is not that big a deal
to add post-1.0 even, though I'm not sure without exploring further.

A bit more on the topic of "policy-free" - libdbus has convenience
functions that block, but they are all optional. i.e. you can use
libdbus without ever calling them - you could instead implement your own
blocking in terms of a main loop.

The idea of something like dbus_connection_send_with_reply_and_block()
is that you'd call it from a thread dedicated to blocking dbus calls,
use it in simple apps without a main loop, or use it in an app that
isn't likely to get into the deadlock scenario for whatever reason
(maybe it exports no interfaces, i.e. is a pure client). 

Anyway, for the GLib bindings we do use the libdbus blocking calls, to
avoid uncontrolled reentrancy; but I don't think there's anything about
libdbus that requires this approach - other than thread mutex bugs
perhaps ;-)

> Depending on how the call is handled by the application, the Qt bindings 
> will lose information and not reply 1:1.
> 
> If the idea is to just round-trip, 1:1 matching is done already (or so I 
> believe).

I may not be clear, what I mean by "round trip" is to demarshal to
native types and then marshal the same values again as dbus types.

There are a couple things this test would be designed to do:
 - be sure there's some way to marshal each dbus type from a given 
   binding; otherwise some interfaces would not be usable from the 
   binding
 - be sure the binding deals with all the tricky cases
   (zero-length arrays, structs with arrays in them, arrays of struct,
    etc.) - the test function I mentioned returns all the various 
   combinations of types

It would be legitimate to demarshal all sizes of dbus integer as a
single kind of integer though (for example), and then remarshaling would
not be possible without loss of information.

A couple fixes for that, still preserving the purpose of the test:
 - for the test suite only, keep lookaside/meta information on the 
   exact dbus type that was demarshaled, for use on remarshal
 - "fuzzy equality" for test success (allow an int16 that gets 
   remarshaled as int32 to pass)

Just some thoughts, nothing Qt-specific, this would be cool for the
other bindings too IMO.

Havoc