The deadlocks [was: Announcing new version of the Qt4 bindings]

Tue Feb 14 02:05:47 PST 2006

Havoc Pennington wrote:
>On Mon, 2006-02-13 at 23:41 +0100, Thiago Macieira wrote:
>> If a function, which has been called as a result of signal delivery or
>> reply delivery, tries to place a new blocking call, the whole
>> application deadlocks.
>>
>> The reason for that is that it dbus_connection_dispatch tries to
>> acquire the dispatcher while it's already locked by the same thread.
>
>I think there are two partially separable issues; one is the mutex
>deadlock, since the dbus API doesn't have explicit locking I would say
>any mutex deadlock is probably a bug. Two though is the bigger-picture
>"old deadlock problem" as you say, where process A makes a blocking call
>to process B, and process B then makes a blocking call to process A, and
>both block on the socket waiting for the other to send a reply message.

Indeed, that's two separate issues. I realised that after going home last 
night.

> * @todo FIXME what if we call out to application code to handle a
> * message, holding the dispatch lock, and the application code runs
> * the main loop and dispatches again? Probably deadlocks at the
> * moment. Maybe we want a dispatch status of DBUS_DISPATCH_IN_PROGRESS,
> * and then the GSource etc. could handle the situation? Right now
> * our GSource is NO_RECURSE
>
>As this @todo hints the fix I'd like to see is that dispatch becomes a
>no-op in this case, but it needs to return a special status so the main
>loop can cope.

I think I can solve this inside the binding without need for anything 
special inside libdbus. I'll let you know in a while if my idea worked.

Currently, the process stack goes like this:
event loop -> socket notifier -> dbus_connection_dispatch -> filter -> 
user function

If the user function recurses into another loop, it'll try to acquire the 
dispatcher again and deadlock (since the mutex is non-reentrant).

My solution will be to rely on the behaviour that dbus_connection_dispatch 
pops never more than one message from the input queue. That way, we'll 
have:

event loop -> socket notifier -> dbus_connection_dispatch -> filter

The filter will just prepare the call, but not execute it. When 
dbus_connection_dispatch exits, the socket notifier code will realise 
there's a pending call to be delivered and will deliver from there. This 
way, the DBus dispatcher will be unlocked for further calls.

>But this is a fragile thing to rely on, since people can access
>DBusConnection directly without going through the main loop, so really
>we need to trap the recursion inside DBusConnection itself.

The way I understand it, all it would take is to call the user filter 
function with an unlocked dispatcher. This would prevent the reentrancy 
problem by allowing reentrancy.

Maybe a solution in the style I'm preparing could make sense?

>I'm not sure how this works out with the Qt main loop so your advice is
>welcome there.

I've changed the code to reenter an event loop every time a blocking call 
is made. I'm still evaluating whether we'll have too big impacts because 
of that.

>Moving on to the larger issue of cross-process deadlock on the socket,
>I think dbus is basically "policy free" on this and allows bindings to
>choose from among the possible tradeoffs, though it could offer
>additional support for a DCOP-style approach.
>
>Some history, with ORBit all blocking calls simply ran the main loop
>including dispatch of other incoming CORBA calls; my view is that this
>was the proverbial Bad Idea, it's sort of like writing threaded code,
>only without mutexes ;-) Nautilus (which heavily used ORBit) used to
>crash all the time due to this - always in a difficult-to-reproduce way.
>
>DCOP I gather has a slightly more clever approach, which is tracking an
>ID for the current "call stack" and allowing reentrancy only within that
>stack. This is more difficult with dbus because dbus calls/replies are
>not synchronous and the "call stack id" would have to be passed around
>in a bunch of places in the API, rather than kept implicitly as global
>state. Waldo's message and surrounding thread I linked to above
>discusses this some.

We'd need something like a central "context" ID to track in order to 
determine if a message we've received is related to the call we placed or 
not. This is what DCOP does, but this ID is generated by the server. The 
D-Bus method-call IDs are generated by the application.

We'll see how this plays out with my solution. If all goes as planned, 
every call will be able to recurse indefinetely. So if the application 
deadlocks, it's actually because the application can't handle it, not the 
binding or D-Bus.

So I'd have to make it very clear to the user that any call to D-Bus may 
cause recursions. And users should make it clear to other users that 
their function calls upon D-Bus, which means it can recurse. It's like 
exceptions in Java: it's part of the contract that this function throws 
that exception -- it's part of the contract that this function can 
recurse.

>If someone wanted to experiment with the limited-reentrancy approach as
>in DCOP, I think it would be interesting to get more knowledge on how it
>would play out in the dbus API. Potentially this is not that big a deal
>to add post-1.0 even, though I'm not sure without exploring further.

The DCOP style involves passing this "cookie" or "token" around, 
indicating which context this function is related. This is to handle the 
case of A calling B, B calling C and then C calling back A.

If it were just A -> B -> A, it would be easy to detect on the library: if 
we're expecting a reply from B and B is making a call, it's got to be 
related. But if we get a call from C, how do we know it's related?

It would be necessary to get a token from the call from C indicating that 
it knew about our call. It could be something as simple as our base 
service + our method serial.

This would mean that either DBusConnection gets stateful -- this token is 
stored in the connection itself -- or it's on the API for new messages. 
The first solution is more or less what DCOP does. But I don't see it in 
being implemented in D-Bus because the connection could be used from a 
separate thread (which is something that wasn't allowed in DCOP) to make 
a completely unrelated call. We can't block other threads either, because 
they could be making _related_ calls.

If it's in the API for new messages, it's an API change.

>Anyway, for the GLib bindings we do use the libdbus blocking calls, to
>avoid uncontrolled reentrancy; but I don't think there's anything about
>libdbus that requires this approach - other than thread mutex bugs
>perhaps ;-)

We'll see how uncontrolled reentrancy plays out.

[the rest of the message is replied to in another email]
-- 
Thiago José Macieira - thiago.macieira AT trolltech.com
Trolltech AS - Sandakerveien 116, NO-0402 Oslo, Norway
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/dbus/attachments/20060214/b0d5f88c/attachment-0001.pgp