D-Bus optimizations

Wed Mar 14 13:38:13 PDT 2012

Hi,

On Wed, Mar 14, 2012 at 5:03 AM, Bart Cerneels
<bart.cerneels at collabora.co.uk> wrote:
> Context switches is the biggest single problem to be solved since it causes
> priority inversions. The anecdotal case was just to point out the underlying
> architectural problem causes very real bugs. Don't think it's the only
> project stumbling on this, just the best understood.

Again, I'm not saying there isn't a problem, just that it has to be
well-defined in order to evaluate solutions - fortunately this paper
is very good on that point!

> Here is some supporting material:
> http://shorestreet.com/sites/shorestreet.com/files/dbus-perf-report-0.9.9.pdf
> Notice the heavy emphasis on priority and scheduling.

Did I miss this before? I apologize if so. This is very useful.

Some thoughts -

1)

In general, reading this, I think in some cases there are problems
that make sense to fix in dbus, and in other cases there are problems
that are best solved by not using dbus.

In particular I think the proposal to make it configurable whether to
use the daemon, whether to use the serialization format, whether to
have ordering, whether to have security, etc. is not a good direction.
There is a configurable solution already. It's called sockets. And
there are about 10000 IPC solutions already, from ICE (both of them)
to ZeroMQ to AMQP to CORBA to X11 to HTTP to SOAP to WebSockets to
SUN-RPC to whatever-the-heck. To me, trying to make dbus configurable
so that it can substitute for any of these is a Bad Idea (tm). Just
use an appropriate solution. If necessary, dbus can be used to
"bootstrap" the appropriate solution. But jamming a crappy version of
every IPC tradeoff into dbus seems like a wild goose chase (not to
mention a lot of work). dbus is the tradeoff that it is, the other
things are the tradeoff that they are, don't try to be what you
aren't.

What _does_ make sense to me is to piggyback on dbus for discovery and
authentication, so the other solutions could just get an fd passed
over dbus and not have to deal with "setup" of the connection,
perhaps.

The "fast sidechannel" patch is logical to me here. Take that 176K
telepathy message mentioned in section 3.4, move it to the
sidechannel, for example.

You can also coordinate things like shared mem over dbus. Put your
176K in shared mem somewhere. Tell clients about it via dbus, and use
dbus to coordinate when the clients are finished with it, so you can
deallocate.

2)

That said, it's clear that when you _do_ want to do what dbus does,
there are some improvements available.

3)

Earlier in the thread I suggested that dbus has a potentially
misguided (de)optimization where it's reading and writing each message
one-by-one. The idea of the existing code was to use the DBusMessage
as the buffer and avoid a memcpy. I speculated this causes extra
context switches. If I'm understanding section 3.6.2 bullet point 3
correctly, this paper is saying that my speculation was correct.

Have you tried fixing this? i.e. read() and write() large buffers?
It's probably simplest to experiment by just introducing another copy
(an intermediate buffer that DBusMessage is copied to/from), but a
more difficult-to-implement version might try to use readv/writev to
avoid that extra copy.

This seems like a very approachable patch. For cases where protocols
are well-designed, it could hugely reduce context switches - I don't
know, but if every read/write switches right now, then batching 10
messages is a 10x reduction in switches, right? Now if you're
ping-ponging this won't help, but it's a bad idea to be round-tripping
anyway.

This seems like a low-hanging fruit to go straight toward, to me.
Unless someone already did it, I haven't looked in git.

But if context switches are the problem, and the above analysis is
correct, then this is a *small* and *easy to get upstream* kind of
patch that could help a whole lot... has anyone pursued it? why not
start here and see if it helps?

4)

The last time I profiled dbus (a long time ago) I reached the same
conclusion about "distributed bloat" and no clear hotspots, though
there are a couple of wins like disabling validation.

One thought I used to have on how to speed it up was to use threads so
that the socket-io/serialize/deserialize code could be blocking. I
think we discussed implementing libgbus in this way, though I don't
know if it ended up being faster due to that or slower due to other
stuff or what. This would simplify various parts of the code, and
allow using stack instead of malloc much more often.

But, this is a major and difficult change. (though no more so than the
other major reworks discussed in the paper.) Fortunately the
"political" considerations from years ago that kept libdbus from using
threads originally are probably no longer a concern.

You could also play around with memory allocation and try to improve
cache locality. There's already some attempt to re-use memory blocks
in dbus-message.c (or used to be), but it's maybe a little bit lame.
Ideas could include: rounding off DBusMessage sizes to try to get
exactly the same size more often or match page sizes; make sure glibc
is using mmap on large blocks (didn't alexl have some patch to
nautilus or something related to this? fuzzy memory); kooky idea - in
the daemon, try to alloca() message buffers as long as they can be
synchronously dispatched and are small-ish... you could speculatively
alloca then copy to malloc if you end up popping the stack ;-) kooky.

There was that bug Simon linked to, to write the sender on the client
side so the daemon doesn't realloc() every single message - that's a
clever and _easy_ idea, that probably would help. Why not try that?

5)

Historical note about the last time I profiled: at one point I rewrote
the marshal/demarshal to support recursive types. Originally you could
have an array of string but not an array of array of string. I did a
lot of optimization *before* the recursive rewrite, and the recursive
rewrite made things quite a bit slower.

What this may imply is that the marshal/demarshal code is a problem.
Maybe even if it doesn't show as a hotspot, it's somehow wrecking
caching or something else. In any case, that was a known point in time
where things went from fast-ish to slow-ish. Perhaps I changed some
unrelated thing at the same time and that's what broke it, I don't
know.

6)

I wonder if the equivalent of GSlice would help.

7)

I kind of question that shared memory would be faster than unix
sockets once you got done with all the coordination of it. From the
paper, it doesn't really seem like unix sockets are the bottleneck,
and the trick with using shared memory is that you replace copies with
concurrency issues which can be slower and buggier than the copies.

Look at X; I believe there was a failed attempt to thread the X
server. Where shared mem does make sense for X is big pixmaps. I bet
dbus is the same thing; it makes sense for really big messages.

Wouldn't there be a relatively easy approach where _some_ DBusMessage
(the big ones) use a shared memory block as their buffer; and just
dedicate that particular block to that message, and have it passed
around among the processes? I guess the hard part is figuring out who
frees the block, but maybe it's managed by the daemon ...

It seems like there must be a way to transparently avoid copies only
for "big" messages, and basically encapsulate the optimization in
DBusMessage much as DBusMessage can currently hold a file descriptor.

8)

Again, dbus basically a rip-off of how X works. I seem to remember
lots of discussion of priority inversion around the X server. What was
the solution there? Or why is it not a problem there? The same answers
may apply here. Someone should get the X gurus to weigh in. I'm pretty
sure they've struggled with this for years. I also thought there was
some miraculous scheduler patch or some change that went in to the
Linux distributions at some point that supposedly fixed it... maybe I
have it wrong.

9)

Priority inversion has _got_ to be possible to address by turning some
kind of scheduler knob. Solving a scheduling problem by completely
rewriting a system just seems wrong.

10)

Section 3.4, typical usage, could have _far_ more data. How much of
the usage is round trips, and how much is batch-able? How much of the
usage (in all of # of messages, # of round trips, # of bytes of data)
comes from each of the mentioned applications? For signals sent, how
many clients receive each signal? What is the maximum "bog-down" (how
long do queues get in both time and # of messages for example)?
This section is the one that covers actual user-visible scenarios with
real apps, but it doesn't give answers on things like what kind of
throughput and latency are needed, what's achieved, and when the
problem is application design vs. dbus performance.

Basically I think this section needs to answer a couple things:

 - to what extent are the apps involved "at fault" in that they are
doing something that's just not a good idea
 - for the app protocols that are reasonable, what is the "current
number" and "goal number"  to get a good experience ("number" = some
measurable description of the scenario)

The trap to fall into is that "faster is always better" - it isn't,
because there are semantic concerns and back compat concerns and
amount of work concerns. There has to be a goal other than "as fast as
raw sockets" - if you want as fast as raw sockets, use raw sockets.

A simple variation on section 3.4 would be to run the numbers with
each individual app removed, and then you could characterize the
number and type of messages to be blamed on each app.

Instrumentation to detect blocking round trips seems critical. Such an
important number.

10)

The ideal might be a reproducible synthetic benchmark that reflects
the typical usage. The ping-pong benchmark is known-stupid usage, so
isn't the right thing to optimize. I think a better benchmark might
include lots of 300-bytes-ish messages (based on section 3.4 average
size) with only a few round trips mixed in, and some
one-sender-one-receiver messages and some
one-sender-signalling-to-typical-number-of-receivers messages. Or
those benchmarks could be separate (e.g. have one that measures
sending tons of small messages without ever blocking, one that
measures sending small signals to N receivers without ever blocking,
and then maybe variants that do larger messages and variants that have
more receivers). But when evaluating a patch, you'd run all the
benchmarks and variants to get an overview of that patch's effect in
each scenario.

A good agreed-on benchmark that represents "correct" usage of dbus
would really help evaluate a series of performance patches.

11)

The bottom line I'm trying to get at is that there's a ton of
low-hanging-fruit available that isn't going to complexify things,
break semantics, make all the reimplementations scurry around fixing
their stuff, involve arguing with kernel developers, etc. Of course
that low-hanging-fruit may not be good enough, but why not try it?

Havoc