max_outgoing_bytes: What if a D-Bus peer is too slow to read?

Fri Oct 8 08:16:50 PDT 2010

Hi,

On Fri, Oct 8, 2010 at 10:14 AM, Colin Walters <walters at verbum.org> wrote:
> Well, one thing we have discovered is that the current semantics of
> the bus daemon are pretty unfortunate, specifically the random
> arbitrary limits (127MiB and 1GB respectively for the system and
> session bus).

It would be within the spec, if the spec existed, to set these limits
to infinite, if that's what you're suggesting.

Another approach, I think you could make the sender do the buffering
without changing any semantic. Right now the daemon reads a message,
decides it can't buffer it, and generates an error. Instead the daemon
could read a message header, decide it can't buffer a message of that
size, and then just stop reading from the socket until it can buffer
it. Now the sender buffers messages and no messages ever fail to send.
The exception would be if the single message is itself too large and
can never succeed, then the daemon could generate an error.

Those are probably both better than sending the error, since the error
isn't easily recoverable.

Anyway sort of a separate problem from kernel dbus I think. If this is
an issue there must be simple fixes.

The current code amounts to "if something that never happens does
happen, it's probably something malicious or at least really buggy, so
basically give up and punt." That could be improved.
if you set the limits low enough that they are actually reached, then
you end up _having_ to improve this in your implementation, you can't
sort of punt it as dbus-daemon does.

> Actually, I have gotten complaints from people using dbus on even
> server-class hardware.  The primary complaint was about latency when
> the OS was virtualized (I'm guessing where kvm was configured with 1
> virtual CPU).  I didn't debug it, but my guess is that that kind of
> situation is going to magnify greatly the scheduling latency,
> especially if there's contention for the physical cores.

I don't think this is a specific enough problem report to say what the
solution is, though it'd be great to dig in. What were they doing -
when was latency important? Fixable by fixing an app? Why was the
latency so high, was it due to context switching or what? What was the
latency exactly and is there some way we could reproduce and measure
this situation?

It sort of sounds like their virtualization setup just sucked....
adding thousands of lines of tricky code is maybe not the solution to
that.

> It probably doesn't help that I think RHEL5 shipped with a version of
> dbus-glib which watched all NameOwnerChanged signals, causing
> dbus-daemon to write to a ton of different proceses.

Indeed that would be bad. This is the kind of app (via library) bug
that just needs fixing.

> This is part of the point of the discussion, we're debating whether
> the dbus-daemon semantics are useful and/or right.  If it makes things
> easier for the kernel implementation, and nothing relies on them, we
> could look at changing them.

I agree with that, sure.

One tempting thing would be to completely punt to the app by providing
a convenience API for setting up a direct dbus connection that would
be totally unrelated to the daemon. Basically have a method like:

 GetDirectDBus(out fd)

The peer would create a new socketpair() and wrap its end in a
DBusConnection or gdbus equivalent, and then pass the other end out to
the GetDirectDBus caller. We'd have a convenience API:

 DBusConnection* dbus_bus_get_direct(DBusConnection* bus, const char *bus_name);

Which would GetDirectDBus, get the fd, and return it wrapped in a new
connection.

The idea would be to make this easy enough that if you're going to
have a long conversation with an app or send lots of data, you could
just fire up a new socket. Combined with splice() to avoid copies you
could really shovel large data over it and it'd be sane.

Then the elaboration would be what I was talking about before, some
way to automatically interleave stuff from the direct link into the
main connection's message stream. It could be as somewhat-simple as
having a new header field for a client serial shared across the two
connections, and before returning a message to the app, the main
connection would require that it gets the next expected serial. So if
you get 1,2,3,5 on main connection and 4 on the direct connection,
then the main connection returns 1,2,3, blocks for 4 which shows up on
direct, then it can return 5. At that point the API becomes just:

 dbus_bus_go_direct(bus, bus_name);

and then the bus moves traffic with bus_name over to a direct link
transparently.

I guess this is a lot of work but it still sounds easier than kernel dbus ;-)

Havoc