performance measurement

Sat Nov 27 00:29:42 PST 2004

Hi,

I've been doing some optimization. For the curious, here's where
performance currently stands. Not fantastic yet. Build with --disable-
asserts --disable-verbose-mode if you are planning to profile. You could
also --disable-checks but that's a little bogus since we'd normally
expect to enable checks in the installed libdbus. Fortunately --disable-
checks offers no measurable performance gain in my tests.

To run the profile I'm using:
  cd test/glib
  PROFILE_TYPE=all DBUS_TOP_BUILDDIR=/home/hp/dbus-cvs/dbus ./run-test.sh profile

The profiler is set up to baseline vs. plain sockets rather than only
giving times, since the raw times depend on the machine.
Results from my Intel(R) Pentium(R) M processor 1300MHz are:

  Baseline plain sockets time 22.1879 seconds for 750000 iterations
   1 times slower for plain sockets (22.1879 seconds, 0.000030 per iteration)
   1.03088 times slower for plain sockets with malloc overhead (22.873 seconds, 0.000030 per iteration)
   2.42119 times slower for dbus direct without bus (53.7211 seconds, 0.000072 per iteration)
   5.22905 times slower for routing via a bus (116.022 seconds, 0.000155 per iteration)

Here are results using the November 11 source tree:
   3.2281 times slower for dbus direct without bus (76.5563 seconds, 0.000102 per iteration)
   6.82132 times slower for routing via a bus (161.771 seconds, 0.000216 per iteration)

The profile iteration is a round trip, one method call and the reply,
synchronously. The four cases are:

plain sockets 
    exchange data packets with read/write directly, no 
    parsing, allocation, or even buffering
plain sockets with malloc
    buffer the data that's read/written on the heap
dbus direct without bus
    two threads talking to each other directly via libdbus
routing via a bus
    two threads talking to each other via a session-bus-like bus

I find that I need the 750k iterations to get consistent results, if I
only do a few tens of thousands times will vary wildly.

I've had the most luck profiling with Soeren's "sysprof" in GNOME CVS;
valgrind seems to be somewhat misleading if the goal is to optimize wall
clock time.

It looks to me like most of the remaining performance wins would be in
marshaling and unmarshaling/validation, possibly including an option to
turn validation off. But there could also be clever wins that aren't
immediately obvious.

There would also be a performance win by just cutting out many of the
abstraction layers in libdbus, but I guess I consider most of those
layers to be worth it. e.g. DBusString is a big security/stability win.

An abstraction that could be a win to remove is DBusMessage, instead you
could do a thing like:
dbus_connection_write (connection, METHOD_CALL,
                       INTERFACE, "MyInterface",
                       MEMBER, "MyMethod",
                       ARGS, TYPE_STRING, "foo",
                       TYPE_INVALID)
this would try to write directly to the socket and never marshal to a
buffer (just marshal straight to the stream). You could do something
similar for reading. But it's a ton of work and no guarantee it turns
out to be a good idea or even faster at all. Perhaps if we did this
alongside rather than instead of DBusMessage it would be feasible.

To be honest I'm not sure more performance tuning is really worthwhile;
for any desktop/systemwide application we're currently thinking about,
it's unlikely that dbus would be even 1% of the overall profile, so a 5%
improvement to dbus is only 0.05% improvement to real-life applications.
One dbus performance case to consider may be large dbus messages, e.g.
for Imsep, but my guess is that there's no way to fundamentally speed
this up without something like the above dbus_connection_write() that
avoids the copy into DBusMessage. For copying out of DBusMessage right
now you need two copies (the copy out the API already forces, and the
copy into your own data structure), it's trivial to remove one of those
via const accessors to DBusMessage, the other one requires a
dbus_connection_read() sort of API.

Incidentally, there are some warnings/bugs from valgrind skins helgrind
and memcheck that could use investigation if anyone is looking for a
straightforward hacking task.

Havoc