D-Bus optimizations

Fri Mar 2 14:52:12 PST 2012

Hi,

On Fri, Mar 2, 2012 at 12:10 PM, Bart Cerneels
<bart.cerneels at collabora.co.uk> wrote:
> daemon, but the whole process just took to long. Main reason: dbus-daemon
> was to busy any every DBus using process was dragged down in priority
> because of the priority inversion caused by the context switching from peer
> to daemon to peer.

Too busy doing what, would be my first impulse to investigate here -
what is so chatty that dbus-daemon is flooded and backlogged? Where is
the daemon spending its CPU time when it's backlogged?

If the goal is specifically that the daemon can't be flooded with
stuff because that causes context switches that confuse the scheduler,
then do you really need to undertake a difficult change to dbus
itself, or can you just figure out what's flooding it with stuff and
stop it doing that?  Or somehow improve the scheduling?

Maybe not, maybe it's inherently flooded just with normal desktop
stuff these days. I don't know. But pragmatically speaking that's what
I'd want to know. I mean, this change-dbus approach just sounds really
hard to me, _if_ done with docs and tests and back compat and all the
client libs supporting it and all that. I could be wrong.

Also worth remembering that the N900 looks like it's iPhone
3GS-generation specs, slightly less powerful than Raspberry Pi for
example, notably less powerful than current top-end smartphones, and
this multicast feature would probably be finished only in a couple
years when we'll be on for example iPhone 6 ... yeah, there will
always be someone running something slow, but... one notable feature
of smartphones at the moment is that multicore is rapidly becoming
standard issue.

It would just be great to have some kind of clear and measurable goal
in mind. If the goal is "as fast as possible" / "runs on the
lowest-end hardware we can come up with no matter what" / "as fast as
raw sockets" then I don't know - maybe it'd be better to do some
really simple, fast side-channel with dbus-compatible type system and
encourage including it alongside all the dbus implementations.
Piggybacking on dbus for authentication and setup and so on, so it
could be a pretty simple spec for sidechannel. Heck, in some sense
maybe multicast _is_ this sidechannel, or could be slightly adjusted
to be that.

Sidechannel could even be better, because you could actively seek to
drop semantic guarantees that regular dbus has, and avoid some of
these back compat worries.

> That is not a position that can be maintained while at the same time wanting
> to be pragmatic. Point is DBus is being use a lot on modern systems, many
> key middleware components depend on it and many more will in the future
> because of it's critical mass. I realize scalability was probably also not
> one of the goals, but it clearly has issue there.
> Besides, quoting from the Introduction chapter of the spec:
> "D-Bus is low-latency ... low-overhead ... easy to use"
> The idea is certainly right, just needs to be maintained in the future.

It just seems even more pragmatic to me to say "what's causing all the
dbus traffic, and which bus is it on" first. I'm not saying nobody
has, maybe even on the list, I probably missed it.

If there are no tradeoffs, certainly nobody is against speed
improvements, and maybe multicast can be done without tradeoffs. I'd
just be wary of taking the original tradeoffs and slowly weakening the
semantics and complicating the implementation to the point that what's
left makes nobody happy. It's important to keep a design coherent. If
an IPC system could be universal, we'd have stuck with CORBA.

I don't have a good understanding of how multicast affects things like
ordering, spoofing, security policies, etc. I'm purely thinking "sure
seems like it would affect some of that and I'd like to see the
spec/docs patch before spending too much time implementing"

If we're talking about 2x speedup with no semantic changes or
tradeoffs then yeah, that sounds awesome.

fwiw I think the spec is pretty clear in the same section you're
talking about that dbus makes certain tradeoffs. The actual text about
"low-latency" has an (admittedly bullshit) interpretation of
"low-latency" as just "avoids round trips", which is the only sense in
which low-latency could ever have been true of dbus. It sadly has
never been low-latency in the sense of "uses a relatively low amount
of time to make a round trip."

Again, I'm not a current maintainer of anything remotely involved in
this discussion, so you don't have to convince me. The above is just
the first stuff I'd ask if I were personally tasked with fixing the
"dropped calls on N900" problem, based on the info in this thread. I
know you're all way ahead of me on a lot of this.

Havoc