method timeouts vs systemd activation and slow systems

Mon Mar 6 12:39:42 UTC 2023

On Thu, 02 Mar 2023 at 10:15:23 -0500, Colin Walters wrote:
> Long ago I think Havoc chose to default to 25 second timeouts for DBus
> method calls.  This is generally OK on mostly idle single-user desktop
> systems.  (Particularly modern laptops/desktops with NVMe drives etc.)

I believe the reasoning for this was that it's a fairly arbitrary
balancing act: long enough that it won't be reached on any system with
reasonable interactive usability, and long enough that if it's reached
then there is obviously a problem, but also short enough that forward
progress is still made.

> But servers...well, it turns out that some people want to e.g. use
> under-resourced (or rather over-provisioned) private OpenStack instances
> for example - and there, I/O can be *really* slow.  And basically we start
> seeing the 25 second DBus method call show up here as system failures.

A high-quality D-Bus client library should let you specify any method
call timeout of your choice, including "infinite" (for example libdbus
allows timeouts between 0 and 0x7ffffffe milliseconds, but special-cases
DBUS_TIMEOUT_INFINITE = 0x7fffffff as meaning "wait forever" instead of
its historical meaning of almost 25 days), with the hard-coded default
timeout used for a negative user-specified timeout being merely a default.

Of course, not all D-Bus client libraries are high-quality: notably,
the way to set this timeout in the obsolete dbus-glib library has
historically been rather haphazard (much like the rest of dbus-glib).

There are some timeouts in the "limits" data structure of the message
bus implementation (dbus-daemon or dbus-broker) which are somewhat
orthogonal to the client-side method call timeout. In dbus-daemon
configuration language:

* service_start_timeout (default 25s)
* auth_timeout (default 5s)
* pending_fd_timeout (default 150s)

In particular, a method call that activates a new service will fail after
the client-side method call timeout or the message bus's
service_start_timeout, whichever is shorter. I know Ubuntu used to patch
their system.conf so that service activation could still work live CDs,
back when those were a thing.

Some of these limits are security-sensitive, in that they are used as
an anti-denial-of-service mechanism to prevent a malicious client from
tying up finite message bus resources indefinitely (we probably cannot
ever fully prevent denial of service via resource exhaustion, but these
limits mitigate it).

There is also a reply_timeout limit which is equivalent to, but
independent of, the client-side method call timeout: a method call will
fail after the client-side method call timeout or the reply_timeout,
whichever is shorter. However, since 2009 the default reply_timeout has
been infinite, so in practice only the client-side method call timeout
is active.

> - Change dbus-{daemon,broker} to have a config for this

To be clear, are you proposing that the message bus implementations
should have a configuration option that doesn't actually do anything
within the message bus itself, but is passed on to their clients as a
suggestion "this would be a good default for you to use"?

> - Fix client libraries to query the daemon for the timeout (or parse
> the configs on their own? not sure)

I think they would have to ask the message bus implementation, unless
we're willing to make every high-quality D-Bus client library parse
XML. I don't want to make libdbus depend on libexpat, and I suspect
that team systemd would be even less enthusiastic about that prospect
for libsystemd.

The two obvious ways to query this would be to include it in the SASL
handshake like NEGOTIATE_UNIX_FD, or include it in the message bus's
o.fd.DBus.Properties alongside the Features and Interfaces properties
defined in the spec
<https://dbus.freedesktop.org/doc/dbus-specification.html#standard-interfaces-properties>.

As others have said already, there would be a bit of a chicken-and-egg
problem with using a DBus method call to retrieve this limit, but it
would be easy for client libraries to special-case the initial method
calls to have a long or infinite timeout. Pseudocode:

    C: <SASL stuff>
    S: <SASL stuff>
    C: BEGIN
    C: o.fd.DBus.Hello (with long or infinite timeout)
    S: reply to o.fd.DBus.Hello with the unique name
    C: o.fd.DBus.GetAll(o.fd.DBus) (with long or infinite timeout)
    S: {Features: <[...]>, Interfaces: <[...]>, SuggestedTimeout: <42000>}
    C: (continue with its real work, with default timeout = 42000ms)

The client could easily pipeline the Hello and GetAll messages, if desired:
that's going to be a lot more reliable than pipelining the SASL handshake,
because "normal D-Bus" is more expressive than the SASL handshake, and has
a better framing protocol.

There are some practical problems with including the suggested timeout
in the SASL handshake as suggested elsewhere in the thread:

* Until the SASL handshake has finished, the client connection is taking
  up one of a finite number of "slots" in the message bus, but we do not
  "officially" know whose connection it is. In practice we could often
  shortcut the authentication by assuming that it will "probably" use
  EXTERNAL and reading the SO_PEERCRED from an AF_UNIX client connection,
  but that's a layering violation (the purpose of the SASL handshake
  is to establish the client's identity), and in principle there is
  nothing to stop a client from authenticating on behalf of a different,
  cooperating uid by using DBUS_COOKIE_SHA1 or similar. As a result,
  the whole authentication/authorization handshake has a relatively short
  timeout, defaulting to 5 seconds, to avoid unauthenticated clients being
  able to carry out a denial of service attack via resource exhaustion.

* The SASL handshake is brittle and its implementations are generally
  some of the worst-understood parts of any particular library. In
  particular, until recently, GDBus' state machine for some mechanisms
  was wrong for several code paths, which can easily result in deadlocks.

* Because client implementors (notably in libsystemd) want to be able to
  pipeline transactions, but the D-Bus Specification was never really
  designed for that, server implementors have to implement the SASL
  handshake in a very simplistic way, using unbuffered reads. The more
  out-of-scope stuff we squeeze into the SASL handshake, the worse their
  performance will be.

    smcv