Race condition with dbus_connection_send_with_reply()?
Thiago Macieira
thiago at kde.org
Fri Mar 27 13:56:47 PDT 2009
Em Quarta-feira 04 Março 2009, às 08:33:08, Havoc Pennington escreveu:
> http://bugs.freedesktop.org/show_bug.cgi?id=857
> DBusPendingCall was created with a main loop in mind, the assumption
> being people would use dbus_connection_send_with_reply_and_block()
> with threads. But blocking with threads apparently has some issues
> also. Someone needs to figure it out. I would start by getting some
> decent test coverage on threaded dbus usage and/or running helgrind,
> and then also thinking conceptually how it should work of course.
>
> DBusPendingCall has some conceptual problems with threads because the
> notifier function would always be called from the
> connection-dispatching thread not the method-calling thread.
>
> In general I'm not sure it's sane to do much in threads besides block
> for replies, because dispatching can only happen in one thread, and
> dispatching and blocking are about the only two things the connection
> does anyway.
I think I've found another issue. It might be caused by the bug 857 above, but
I am unsure. In any case, it's a real problem on its own.
Background: krunner is a heavily-threaded application that sends D-Bus
blocking queries from any thread. It has a bug which would cause it to query
HAL for each keypress. On top of that, Qt timers only exist in the thread that
created the QObject that handles them.
[aux thread (a thread that is not the connection thread)]
1) QDBusConnection calls dbus_connection_send_with_reply_and_block
2) that function calls dbus_connection_send_with_reply
a) it calls _dbus_pending_call_new_unlocked
b) new DBusPendingCall is returned with refcount = 1
c) it calls _dbus_connection_attach_pending_call_unlocked
i) this calls _dbus_connection_add_timeout_unlocked
- QDBusConnection handles that by posting an event to itself, telling itself
to add the timer
ii) it calls _dbus_pending_call_ref_unlocked
iii) the refcount = 2 now
3) dbus_connection_send_with_reply_and_block calls dbus_pending_call_block.
[GUI thread (the thread where QDBusConnection runs)]
4) the event from #2.c.i is received and a timer is added
[aux thread]
5) dbus_pending_call_block calls _dbus_connection_block_pending_call
a) it calls dbus_pending_call_ref. refcount is now 3
b) it calls check_for_reply_and_update_dispatch_unlocked
i) if there is a reply, it calls complete_pending_call_and_unlock
- that will ref up to 4
- will call _dbus_connection_detach_pending_call_and_unlock
which calls _dbus_pending_call_unref_and_unlock (ref = 3)
- then it unrefs (down to 2)
ii) so if there is a reply, the ref count is down by 1 (from 3 to 2)
iii) then it calls dbus_pending_call_unref (ref is now 1)
iv) if there is no reply, it returns unmodified
c) if it did return true, _dbus_connection_block_pending_call returns (ref=1)
d) if it didn't return true, the refcount is still 3
e) it calls _dbus_connection_do_iteration_unlocked
i) this function may block for up to 25 seconds
f) when it returns, we call _dbus_pending_call_get_completed_unlocked
g) it is true, we call dbus_pending_call_unref (ref is now 2) and return
h) if it's not true, we call check_for_reply_and_update_dispatch_unlocked
i) same as above (#5.b)
i) if it returned true, then same as above, we return with ref = 1
j) if it didn't return true, the refcount is still 3
k) if it didn't return true and we haven't timed out yet (and we haven't
disconnected), then we basically go back to e with a smaller timeout
l) if we have timed out, we call complete_pending_call_and_unlock
i) same as #5.b.i above
ii) so it downs the refcount to 2
m) we call dbus_pending_call_unref (ref is now 1) and return
6) dbus_pending_call_block returns with ref = 1 or 2
a) ref = 2 only happens in case #5.g
that is, if _dbus_connection_block_pending_call iterated and then the reply
was already complete
b) ref = 1 seems to be the most common case, including the timeout case
7) dbus_connection_send_with_reply_and_block calls dbus_pending_call_unref
a) if ref was 1, this causes the DBusPendingCall object to be deleted
[GUI thread]
8) the timer expires. QDBusConnection calls dbus_timeout_handle
a) dbus_timeout_handle does:
return (* timeout->handler) (timeout->handler_data);
where the handler is reply_handler_timeout
i) reply_handler_timeout reads from the already-deleted DBusPendingCall
object (handler_data)
The valgrind log is attached (trimmed to relevant functions).
The race condition stems from the fact that most code paths end up calling
_dbus_connection_detach_pending_call_and_unlock too soon. That is, when this
function is called, it removes the timeout and then downs the reference count.
When it asks for the timeout to be removed, QtDBus function qDBusRemoveTimeout
will realise it's running in the wrong thread. So instead of killing the Qt
timer, it posts an event to itself asking the timer to be killed. Then it
erases the hash entry that associated the Qt timer to the DBusTimeout
structure.
The valgrind log is there showing that the timerEvent function was called, so
the Qt timer wasn't yet killed. It also shows that the dbus_timeout_handle
function was called. And that function is called with a locked mutex that
protects the hash table associating the timers and DBusTimeout.
So I'm at an impossible situation: the valgrind log shows that
dbus_connection_send_with_reply_and_block reached line 3319. Yet the timeout
has not been removed, since QtDBus called dbus_timeout_handle (with a locked
mutex).
So how can line 3319 have the last ref if the DBusTimeout has not been
removed? In fact, how is it possible to reach this line with the DBusTimeout
not removed?
The DBusTimeout (added in #2.c.i) is removed in #5.b.i. The #5.b sequence is
repeated in #5.h; #5.b.i is repeated in #5.l.i. The only other exit point for
dbus_pending_call_block is the one in #5.g, but that wouldn't down the
reference count to 1.
The only place that sets completed to TRUE is _dbus_pending_call_complete,
which in turn is only called by complete_pending_call_and_unlock. The problem
is that complete_pending_call_and_unlock first calls
_dbus_connection_detach_pending_call_and_unlock, which is supposed to remove
the timeout from the connection.
So I'm back to my question: how is it possible that we can reach dbus-
connection.c:3319 with the DBusTimeout still in place?
I'm at loss to explain this.
--
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
Senior Product Manager - Nokia, Qt Software
PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358
-------------- next part --------------
A non-text attachment was scrubbed...
Name: valgrind.log
Type: text/x-log
Size: 1365 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/dbus/attachments/20090327/ba272975/attachment.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
Url : http://lists.freedesktop.org/archives/dbus/attachments/20090327/ba272975/attachment.pgp
More information about the dbus
mailing list