ignoring a process' last words: SIGPIPE ...

Michael Meeks michael at ximian.com
Thu Mar 27 05:15:38 PDT 2008


Hi guys,

	I've been debugging an interesting problem with NetworkManager on
SLED10 (which is dbus-0.6 - but, it's possible the problem persists in
HEAD) - and, it is really quite fun :-)

	I have a rather curious & slow machine, with a 100% repeatable failure
to get a DHCP address when you hot-plug the cable [ in some cases ].
Anyhow - here is the sequence:

        dhcdbd
		+ runs, gets the data from the network, all is
		  well etc.
		+ eventually it spawns (indirectly):

        dbus-send
		+ to send all the good information back to the 
		  mother-ship (network-manager)

	The essential problem is dbus-send fails to get it's data into the
daemon - and the reason (backed up by the strace info I append) is
attached but here are the highlights:

	dbus-send:
		+ starts up & registers itself with the daemon,
	daemon:
		+ the daemon starts a flurry of notifying the world
		  that we have a new guy in town: "NameOwnerChanged"
		  writes all over the shop.
		+ replies to the 'Hello' & marshals the return.
1.	dbus-send:
		+ writes its precious message, closes the fd, exits
		  the app.
2.	daemon:
		+ sends the 'NameAcquired' signal to the new dbus-send
***		+ SIGPIPE on writing to the fd.

	At which point the badness happens:

	The daemon (apparently) immediately closes the connection, oblivious to
the fact that there is critical data in it's socket buffer as-yet
un-read.

	That is not good.

	Of course on most machines, most of the time (for whatever reason), I
assume that 2. happens before 1. and thus everything goes well - but
this is by no means guaranteed etc. etc. [ and is perhaps dependent on
the number of listeners, the vagaries of the scheduler etc. ].

	Amusingly adding a 'sleep (1)' before closing the dbus-connection in
dbus-send reliably works around the problem.

	So having said all this - my two questions are:

	* is this a known / fixed bug in HEAD ?
		+ it appears dbus-transport-unix.c (do_writing)'s
			      if (bytes_written < 0)
				...
				do_io_error (transport);
		+ looks suspiciously similar to
		  dbus-transport-socket.c's do_writing

	* what is the right fix ?
		+ simply ignore EPIPE and report the hangup later in
		  the normal way as/when we poll.
			+ people shouldn't be using such a 'write' 
			  method to reliably catch async. process death
			  IMHO anyway :-)
		 + do a single-socket poll on EPIPE to check there
		  is no pending data to read.
			+ urgh
		+ something else ?

	I'm most interested in the answer. Clearly this bug affects all
short-lived processes that want to write a dying gasp to the bus :-) cf.
the appended for the longer story :-)

	Thanks,

		Michael.

dhcp client finished, and we fork dbus-send to return the good news,

12105 1206606513.515573 execve("/usr/bin/dbus-send", ["/usr/bin/dbus-send", "--system", "--dest=com.redhat.dhcp", "--type=method_call", "/com/redhat/dhcp/eth1", "com.redhat.dhcp.set", "string:old_subnet_mask=255.255.255.0\nold_domain_name_servers=192.168.0.1\nold_broadcast_address=192.168.0.255\nold_expiry=1206606324\nnew_subnet_mask=255.255.255.0\nnew_ip_address=192.168.0.2\nnew_network_number=192.168.0.0\ninterface=eth1\nreason=BOUND\nold_dhcp_message_type=5\nnew_expiry=1206610113\nnew_dhcp_lease_time=3600\npid=12093\nnew_dhcp_server_identifier=192.168.0.1\nnew_routers=192.168.0.1\nnew_domain_name_servers=192.168.0.1\nold_dhcp_server_identifier=192.168.0.1\nnew_dhcp_message_type=5\nold_ip_address=192.168.0.2\nold_dhcp_lease_time=3600\nnew_broadcast_address=192.168.0.255\nold_network_number=192.168.0.0\nold_routers=192.168.0.1"], [/* 28 vars */]) = 0

    ... dbus-send starts up ...

12105 1206606513.747566 poll([{fd=3, events=POLLOUT, revents=POLLOUT}], 1, -1) = 1
12105 1206606513.747628 write(3, "BEGIN\r\n", 7) = 7
12105 1206606513.747689 poll([{fd=3, events=POLLIN|POLLOUT, revents=POLLOUT}], 1, -1) = 1

      dbus-send says hello to the daemon:

12105 1206606513.747761 writev(3, [{"l\1\0\1\0\0\0\0\1\0\0\0n\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\6\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\3\1s\0\5\0\0\0Hello\0\0\0", 128}, {"", 0}], 2) = 128
12105 1206606513.747972 gettimeofday({1206606513, 747986}, NULL) = 0
12105 1206606513.748025 poll([{fd=3, events=POLLIN, revents=POLLIN}], 1, 25000) = 1

      ... dbus-send waits until that is done ...

      dbus-daemon meanwhile gets the message:

2758  1206606513.759122 poll([{fd=3, events=POLLIN}, {fd=6, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=21, events=POLLIN}, {fd=9, events=POLLIN, revents=POLLIN}], 16, 20655) = 1
2758  1206606513.761413 gettimeofday({1206606513, 761462}, NULL) = 0
2758  1206606513.761546 read(9, "BEGIN\r\nl\1\0\1\0\0\0\0\1\0\0\0n\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\6\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\3\1s\0\5\0\0\0Hello\0\0\0", 2048) = 135
2758  1206606513.761771 getuid32()      = 101

      dbus-daemon starts telling the world:

2758  1206606513.762076 access("/var/run/resmgr/classes/desktop/root", F_OK) = -1 ENOENT (No such file or directory)
2758  1206606513.762996 writev(21, [{"l\4\1\1\37\0\0\0\"\0\0\0\211\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\3\1s\0\20\0\0\0NameOwnerChanged\0\0\0\0\0\0\0\0\7\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\10\1g\0\3sss\0\0\0\0\0\0\0\0", 160}, {"\6\0\0\0:1.258\0\0\0\0\0\0\0\0\0\0\6\0\0\0:1.258\0", 31}], 2) = 191
2758  1206606513.763593 writev(20, [{"l\4\1\1\37\0\0\0\"\0\0\0\211\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\3\1s\0\20\0\0\0NameOwnerChanged\0\0\0\0\0\0\0\0\7\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\10\1g\0\3sss\0\0\0\0\0\0\0\0", 160}, {"\6\0\0\0:1.258\0\0\0\0\0\0\0\0\0\0\6\0\0\0:1.258\0", 31}], 2) = 191
2758  1206606513.764674 writev(18, [{"l\4\1\1\37\0\0\0\"\0\0\0\211\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\3\1s\0\20\0\0\0NameOwnerChanged\0\0\0\0\0\0\0\0\7\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\10\1g\0\3sss\0\0\0\0\0\0\0\0", 160}, {"\6\0\0\0:1.258\0\0\0\0\0\0\0\0\0\0\6\0\0\0:1.258\0", 31}], 2) = 191
2758  1206606513.765228 writev(15, [{"l\4\1\1\37\0\0\0\"\0\0\0\211\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\3\1s\0\20\0\0\0NameOwnerChanged\0\0\0\0\0\0\0\0\7\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\10\1g\0\3sss\0\0\0\0\0\0\0\0", 160}, {"\6\0\0\0:1.258\0\0\0\0\0\0\0\0\0\0\6\0\0\0:1.258\0", 31}], 2) = 191
2758  1206606513.765976 writev(12, [{"l\4\1\1\37\0\0\0\"\0\0\0\211\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\3\1s\0\20\0\0\0NameOwnerChanged\0\0\0\0\0\0\0\0\7\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\10\1g\0\3sss\0\0\0\0\0\0\0\0", 160}, {"\6\0\0\0:1.258\0\0\0\0\0\0\0\0\0\0\6\0\0\0:1.258\0", 31}], 2) = 191
2758  1206606513.766239 writev(11, [{"l\4\1\1\37\0\0\0\"\0\0\0\211\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\3\1s\0\20\0\0\0NameOwnerChanged\0\0\0\0\0\0\0\0\7\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\10\1g\0\3sss\0\0\0\0\0\0\0\0", 160}, {"\6\0\0\0:1.258\0\0\0\0\0\0\0\0\0\0\6\0\0\0:1.258\0", 31}], 2) = 191

      dbus-daemon sends reply to Hello:

2758  1206606513.766487 writev(9, [{"l\2\1\1\v\0\0\0\1\0\0\0=\0\0\0\6\1s\0\6\0\0\0:1.258\0\0\5\1u\0\1\0\0\0\10\1g\0\1s\0\0\7\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0", 80}, {"\6\0\0\0:1.258\0", 11}], 2) = 91

      dbus-send resumes & very quickly (~1ms) sends its good news & exits:

12105 1206606513.766749 read(3, "l\2\1\1\v\0\0\0\1\0\0\0=\0\0\0\6\1s\0\6\0\0\0:1.258\0\0\5\1u\0\1\0\0\0\10\1g\0\1s\0\0\7\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\6\0\0\0:1.258\0", 2048) = 91
12105 1206606513.767016 read(3, 0x804d520, 2048) = -1 EAGAIN (Resource temporarily unavailable)
12105 1206606513.767313 writev(3, [{"l\1\0\1v\2\0\0\2\0\0\0g\0\0\0\1\1o\0\25\0\0\0/com/redhat/dhcp/eth1\0\0\0\2\1s\0\17\0\0\0com.redhat.dhcp\0\3\1s\0\3\0\0\0set\0\0\0\0\0\6\1s\0\17\0\0\0com.redhat.dhcp\0\10\1g\0\1s\0\0", 120}, {"q\2\0\0old_subnet_mask=255.255.255.0\nold_domain_name_servers=192.168.0.1\nold_broadcast_address=192.168.0.255\nold_expiry=1206606324\nnew_subnet_mask=255.255.255.0\nnew_ip_address=192.168.0.2\nnew_network_number=192.168.0.0\ninterface=eth1\nreason=BOUND\nold_dhcp_message_type=5\nnew_expiry=1206610113\nnew_dhcp_lease_time=3600\npid=12093\nnew_dhcp_server_identifier=192.168.0.1\nnew_routers=192.168.0.1\nnew_domain_name_servers=192.168.0.1\nold_dhcp_server_identifier=192.168.0.1\nnew_dhcp_message_type=5\nold_ip_address=192.168.0.2\nold_dhcp_lease_time=3600\nnew_broadcast_address=192.168.0.255\nold_network_number=192.168.0.0\nold_routers=192.168.0.1\0", 630}], 2) = 750
12105 1206606513.767802 close(3)        = 0
12105 1206606513.767926 exit_group(0)   = ?

      dbus-daemon resumes, writes it's NameAcquired message, gets a SIGPIPE:

2758  1206606513.771222 writev(9, [{"l\4\1\1\v\0\0\0\2\0\0\0\215\0\0\0\1\1o\0\25\0\0\0/org/freedesktop/DBus\0\0\0\2\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0\3\1s\0\f\0\0\0NameAcquired\0\0\0\0\6\1s\0\6\0\0\0:1.258\0\0\10\1g\0\1s\0\0\7\1s\0\24\0\0\0org.freedesktop.DBus\0\0\0\0", 160}, {"\6\0\0\0:1.258\0", 11}], 2) = -1 EPIPE (Broken pipe)
2758  1206606513.771468 --- SIGPIPE (Broken pipe) @ 0 (0) ---

      dbus-daemon over-reacts by closing the socket which still contains the in-flight
      message from above thus killing the pending data:

2758  1206606513.771689 close(9)        = 0

-- 
 michael.meeks at novell.com  <><, Pseudo Engineer, itinerant idiot

-------------- next part --------------
A non-text attachment was scrubbed...
Name: log-dhcdbc.txt.gz
Type: application/x-gzip
Size: 490851 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/dbus/attachments/20080327/70da4131/attachment-0006.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: log-system-bus.txt.gz
Type: application/x-gzip
Size: 299310 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/dbus/attachments/20080327/70da4131/attachment-0007.bin 


More information about the dbus mailing list