Race condition with dbus_connection_send_with_reply()?

Thiago Macieira thiago at kde.org
Fri Mar 27 13:56:47 PDT 2009


Em Quarta-feira 04 Março 2009, às 08:33:08, Havoc Pennington escreveu:
> http://bugs.freedesktop.org/show_bug.cgi?id=857

> DBusPendingCall was created with a main loop in mind, the assumption
> being people would use dbus_connection_send_with_reply_and_block()
> with threads. But blocking with threads apparently has some issues
> also. Someone needs to figure it out. I would start by getting some
> decent test coverage on threaded dbus usage and/or running helgrind,
> and then also thinking conceptually how it should work of course.
>
> DBusPendingCall has some conceptual problems with threads because the
> notifier function would always be called from the
> connection-dispatching thread not the method-calling thread.
>
> In general I'm not sure it's sane to do much in threads besides block
> for replies, because dispatching can only happen in one thread, and
> dispatching and blocking are about the only two things the connection
> does anyway.

I think I've found another issue. It might be caused by the bug 857 above, but 
I am unsure. In any case, it's a real problem on its own.

Background: krunner is a heavily-threaded application that sends D-Bus 
blocking queries from any thread. It has a bug which would cause it to query 
HAL for each keypress. On top of that, Qt timers only exist in the thread that 
created the QObject that handles them.

	[aux thread (a thread that is not the connection thread)]
1) QDBusConnection calls dbus_connection_send_with_reply_and_block

2) that function calls dbus_connection_send_with_reply
 a) it calls _dbus_pending_call_new_unlocked
 b) new DBusPendingCall is returned with refcount = 1
 c) it calls _dbus_connection_attach_pending_call_unlocked
	i) this calls _dbus_connection_add_timeout_unlocked
		- QDBusConnection handles that by posting an event to itself, telling itself 
to add the timer
	ii) it calls _dbus_pending_call_ref_unlocked
	iii) the refcount = 2 now

3) dbus_connection_send_with_reply_and_block calls dbus_pending_call_block.

	[GUI thread (the thread where QDBusConnection runs)]
4) the event from #2.c.i is received and a timer is added

	[aux thread]
5) dbus_pending_call_block calls _dbus_connection_block_pending_call
 a) it calls dbus_pending_call_ref. refcount is now 3
 b) it calls check_for_reply_and_update_dispatch_unlocked
	i) if there is a reply, it calls complete_pending_call_and_unlock
		- that will ref up to 4
		- will call _dbus_connection_detach_pending_call_and_unlock
			which calls _dbus_pending_call_unref_and_unlock (ref = 3)
		- then it unrefs (down to 2)
	ii) so if there is a reply, the ref count is down by 1 (from 3 to 2)
 	iii) then it calls dbus_pending_call_unref (ref is now 1)
	iv) if there is no reply, it returns unmodified
 c) if it did return true, _dbus_connection_block_pending_call returns (ref=1)

 d) if it didn't return true, the refcount is still 3
 e) it calls _dbus_connection_do_iteration_unlocked
	i) this function may block for up to 25 seconds
 f) when it returns, we call _dbus_pending_call_get_completed_unlocked
 g) it is true, we call dbus_pending_call_unref (ref is now 2) and return

 h) if it's not true, we call check_for_reply_and_update_dispatch_unlocked
	i) same as above (#5.b)
 i) if it returned true, then same as above, we return with ref = 1
 j) if it didn't return true, the refcount is still 3
 k) if it didn't return true and we haven't timed out yet (and we haven't 
disconnected), then we basically go back to e with a smaller timeout
 l) if we have timed out, we call complete_pending_call_and_unlock
	i) same as #5.b.i above
	ii) so it downs the refcount to 2
 m) we call dbus_pending_call_unref (ref is now 1) and return

6) dbus_pending_call_block returns with ref = 1 or 2
 a) ref = 2 only happens in case #5.g
	that is, if _dbus_connection_block_pending_call iterated and then the reply 
	was already complete
 b) ref = 1 seems to be the most common case, including the timeout case

7) dbus_connection_send_with_reply_and_block calls dbus_pending_call_unref
 a) if ref was 1, this causes the DBusPendingCall object to be deleted

	[GUI thread]
8) the timer expires. QDBusConnection calls dbus_timeout_handle
 a) dbus_timeout_handle does:
	return (* timeout->handler) (timeout->handler_data);
     where the handler is reply_handler_timeout
	i) reply_handler_timeout reads from the already-deleted DBusPendingCall 
object (handler_data)

The valgrind log is attached (trimmed to relevant functions).

The race condition stems from the fact that most code paths end up calling 
_dbus_connection_detach_pending_call_and_unlock too soon. That is, when this 
function is called, it removes the timeout and then downs the reference count.

When it asks for the timeout to be removed, QtDBus function qDBusRemoveTimeout 
will realise it's running in the wrong thread. So instead of killing the Qt 
timer, it posts an event to itself asking the timer to be killed. Then it 
erases the hash entry that associated the Qt timer to the DBusTimeout 
structure.

The valgrind log is there showing that the timerEvent function was called, so 
the Qt timer wasn't yet killed. It also shows that the dbus_timeout_handle 
function was called. And that function is called with a locked mutex that 
protects the hash table associating the timers and DBusTimeout.

So I'm at an impossible situation: the valgrind log shows that 
dbus_connection_send_with_reply_and_block reached line 3319. Yet the timeout 
has not been removed, since QtDBus called dbus_timeout_handle (with a locked 
mutex).

So how can line 3319 have the last ref if the DBusTimeout has not been 
removed? In fact, how is it possible to reach this line with the DBusTimeout 
not removed?

The DBusTimeout (added in #2.c.i) is removed in #5.b.i. The #5.b sequence is 
repeated in #5.h; #5.b.i is repeated in #5.l.i. The only other exit point for 
dbus_pending_call_block is the one in #5.g, but that wouldn't down the 
reference count to 1.

The only place that sets completed to TRUE is _dbus_pending_call_complete, 
which in turn is only called by complete_pending_call_and_unlock. The problem 
is that complete_pending_call_and_unlock first calls 
_dbus_connection_detach_pending_call_and_unlock, which is supposed to remove 
the timeout from the connection.

So I'm back to my question: how is it possible that we can reach dbus-
connection.c:3319 with the DBusTimeout still in place?

I'm at loss to explain this.

-- 
Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
  Senior Product Manager - Nokia, Qt Software
      PGP/GPG: 0x6EF45358; fingerprint:
      E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358
-------------- next part --------------
A non-text attachment was scrubbed...
Name: valgrind.log
Type: text/x-log
Size: 1365 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/dbus/attachments/20090327/ba272975/attachment.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
Url : http://lists.freedesktop.org/archives/dbus/attachments/20090327/ba272975/attachment.pgp 


More information about the dbus mailing list