Performance: (was Re: [Accessibility] Re: [Accessibility-atspi] D-Bus AT-SPI - The way forward)

Thu Dec 13 09:15:20 PST 2007

Hi,

Rob Taylor wrote:
> The interesting thing here is even when doing no marshalling and with
> validation disabled, its still at about 2x against ORBit. What other
> design decisions do you think might be influencing this?

I would say first, raw sockets are super fast. There just is not a lot 
of work the kernel is doing; when we write(), we switch into the kernel 
and I believe basically memcpy() into the read() buffer on the other 
side. I'm not a kernel developer, but the point is that the raw unix 
domain sockets case is doing very little CPU work.

Last time I collected actual data, there was not a clear single hotspot 
adding the overhead vs. raw sockets; it was just all the stuff we were 
doing to create and parse messages, queue them up, etc., each little bit 
of work adding to the total.

If you trace through what happens to send and receive dbus messages, 
there is quite a bit of code. We have the whole message queue 
abstraction with DBusConnection, the DBusPendingCall machinery, thread 
locking, parsing and marshaling messages, and so forth. None of this 
code is what I would call micro-optimized; there are lots of function 
calls, lots of abstractions, plenty of malloc().

I think it's not hard to believe that all this machinery around the 
message queue and the DBusMessage object is at least as much CPU work as 
the highly-optimized unix domain socket read()/write(). If we believe 
that, we'd expect at least 2x slower than raw sockets.

Boiling it down to design decisions, some of them are libdbus-specific 
and some are protocol-specific. I don't know how ORBit works in enough 
detail to compare, so I'll just talk about dbus in an absolute sense.

To be clear, this is speculation... profiling is needed.

Example protocol features that I think are potentially slow:
  1) replies are not ordered, as they are in X11 protocol; you can get
     replies in a different order than the calls. Also, signals can be
     interleaved with replies. The result is that a message queue
     is required to implement the protocol.
  2) strings are used instead of integer IDs for various things, such as
     object paths and interfaces
  3) each message has to be walked and validated/unpacked - you could
     imagine a design without the variable-length, tagged header fields,
     for example, so that header fields could be accessed at a fixed
     offset instead of having to build up an index of them
  4) the bus daemon, of course, doubles any round-trip time

Example libdbus decisions that I think are potentially slow:
  1) thread safety (both locks, and extra refcounting, etc.)
  2) validation (including security paranoia, e.g. use of DBusString)
  3) handling out-of-memory often results in elaborate machinery to
     create "transactions," especially in the bus daemon
  4) the main loop glue gunk adds overhead vs. either a single hardcoded
     loop or always blocking
  5) DBusMessage adds overhead; a blocking API could avoid creating that
     intermediate object, and marshal app data structures directly
     to/from sockets. this would break some of the flexibility of
     libdbus, of course, such as ability for multiple handlers to
     get a look at a message.
  6) the "object tree" thing inside DBusConnection - processing a
     message involves parsing an object path and traversing the
     tree to find handlers
  7) the iterator approach to the DBusMessage public API, and the
     internal marshaling APIs, could be slower than an approach where
     the app had to provide all the data at once, perhaps even all the
     data at once in a single struct with known format
  8) abstraction layers; the support for multiple transports, multiple
     ways of doing things (blocking or not, etc.), different thread
     libraries, DBusString, all these layers add a bit more code
  9) the security and resource limits add some overhead to
     keep track of how many bytes of messages we have, etc.

Most of these things are small... my speculation is that roughly 
speaking, since the kernel can read/write from a unix domain socket so 
quickly, a bunch of small things pretty easily add up to a significant 
overhead relative to raw sockets.

If that speculation is right then it will be tough to get close to raw 
socket performance, since so many of the above decisions are embedded in 
the API or protocol.

Perhaps making each of these things a bit faster, though, would make a 
noticeable overall difference. Or maybe if we're lucky there are still a 
couple big "doh!" slownesses in there that can be killed off for a win.

Havoc