Performance: (was Re: [Accessibility] Re: [Accessibility-atspi] D-Bus AT-SPI - The way forward)
Havoc Pennington
hp at redhat.com
Thu Dec 13 09:15:20 PST 2007
Hi,
Rob Taylor wrote:
> The interesting thing here is even when doing no marshalling and with
> validation disabled, its still at about 2x against ORBit. What other
> design decisions do you think might be influencing this?
I would say first, raw sockets are super fast. There just is not a lot
of work the kernel is doing; when we write(), we switch into the kernel
and I believe basically memcpy() into the read() buffer on the other
side. I'm not a kernel developer, but the point is that the raw unix
domain sockets case is doing very little CPU work.
Last time I collected actual data, there was not a clear single hotspot
adding the overhead vs. raw sockets; it was just all the stuff we were
doing to create and parse messages, queue them up, etc., each little bit
of work adding to the total.
If you trace through what happens to send and receive dbus messages,
there is quite a bit of code. We have the whole message queue
abstraction with DBusConnection, the DBusPendingCall machinery, thread
locking, parsing and marshaling messages, and so forth. None of this
code is what I would call micro-optimized; there are lots of function
calls, lots of abstractions, plenty of malloc().
I think it's not hard to believe that all this machinery around the
message queue and the DBusMessage object is at least as much CPU work as
the highly-optimized unix domain socket read()/write(). If we believe
that, we'd expect at least 2x slower than raw sockets.
Boiling it down to design decisions, some of them are libdbus-specific
and some are protocol-specific. I don't know how ORBit works in enough
detail to compare, so I'll just talk about dbus in an absolute sense.
To be clear, this is speculation... profiling is needed.
Example protocol features that I think are potentially slow:
1) replies are not ordered, as they are in X11 protocol; you can get
replies in a different order than the calls. Also, signals can be
interleaved with replies. The result is that a message queue
is required to implement the protocol.
2) strings are used instead of integer IDs for various things, such as
object paths and interfaces
3) each message has to be walked and validated/unpacked - you could
imagine a design without the variable-length, tagged header fields,
for example, so that header fields could be accessed at a fixed
offset instead of having to build up an index of them
4) the bus daemon, of course, doubles any round-trip time
Example libdbus decisions that I think are potentially slow:
1) thread safety (both locks, and extra refcounting, etc.)
2) validation (including security paranoia, e.g. use of DBusString)
3) handling out-of-memory often results in elaborate machinery to
create "transactions," especially in the bus daemon
4) the main loop glue gunk adds overhead vs. either a single hardcoded
loop or always blocking
5) DBusMessage adds overhead; a blocking API could avoid creating that
intermediate object, and marshal app data structures directly
to/from sockets. this would break some of the flexibility of
libdbus, of course, such as ability for multiple handlers to
get a look at a message.
6) the "object tree" thing inside DBusConnection - processing a
message involves parsing an object path and traversing the
tree to find handlers
7) the iterator approach to the DBusMessage public API, and the
internal marshaling APIs, could be slower than an approach where
the app had to provide all the data at once, perhaps even all the
data at once in a single struct with known format
8) abstraction layers; the support for multiple transports, multiple
ways of doing things (blocking or not, etc.), different thread
libraries, DBusString, all these layers add a bit more code
9) the security and resource limits add some overhead to
keep track of how many bytes of messages we have, etc.
Most of these things are small... my speculation is that roughly
speaking, since the kernel can read/write from a unix domain socket so
quickly, a bunch of small things pretty easily add up to a significant
overhead relative to raw sockets.
If that speculation is right then it will be tough to get close to raw
socket performance, since so many of the above decisions are embedded in
the API or protocol.
Perhaps making each of these things a bit faster, though, would make a
noticeable overall difference. Or maybe if we're lucky there are still a
couple big "doh!" slownesses in there that can be killed off for a win.
Havoc
More information about the dbus
mailing list