[systemd-devel] Unexpected D-Bus timeout (https://github.com/systemd/systemd/issues/32381)

Michal Sekletar msekleta at redhat.com
Wed May 15 14:38:30 UTC 2024


On Wed, May 15, 2024 at 2:59 PM Michal Sekletar <msekleta at redhat.com> wrote:

> Hi everyone,
>
> I am currently struggling to understand why I am running into D-Bus
> related timeout on PID 1 API bus, which in turn causes systemd to
> transition the bus to BUS_CLOSING state and stop all Type=dbus services.
> Here is a reproducer that was put together by Frantisek.
>
> #/bin/bash
> set -eux
> set -o pipefail
>
> NMOUNTS=800
> MPOINT="$(findmnt --mountpoint / --noheading -o SOURCE)"
> TS="$(date +"%s.%N")"
>
> systemctl mask --now --runtime systemd-oomd.{socket,service}
> id testuser || useradd testuser
> systemctl stop "tmp-mnt*.mount" || :
> systemctl log-level debug
>
> for i in {0..100}; do
>         echo "=== $i ==="
>         for i in $(seq 0 $NMOUNTS); do
>                 mkdir -p "/tmp/mnt$i"
>                 systemd-mount "$MPOINT" "/tmp/mnt$i" &
>         done
>         wait
>         for i in $(seq 0 $NMOUNTS); do
>                 systemd-umount "/tmp/mnt$i" &
>         done
>         wait
>
>         busctl status --user --machine=testuser@ --no-pager
>         systemctl stop user at 4711.service
>         (! journalctl --since="@$TS" --grep "Connection terminated")
> done
>
> The timeout is detected in process_timeout() in sd-bus, however, the
> reply_callback which timed out has a message with matching reply_cookie
> already in the bus read queue (for a long time). How is it possible it
> wasn't processed in the previous event loop iteration? Clearly in
> reproducer we force systemd to do a bunch of work, i.e. even loop is
> getting dispatched all the time? Btw, in order to verify that reply is
> already sitting in the queue I've compiled systemd with the following patch.
>

Actually we are not looping through the event loop as quickly as I thought.
Each systemd-mount establishes a private D-Bus connection so we end up
having a couple 100s of them and then we are generating D-Bus signals (e.g.
PropertiesChanged) and sending them to all these private buses. Maybe we
can stop doing it because AFAIK systemd-mount doesn't care about that
signal anyway. Can we save a ton of needless work (and maybe fix this issue
as a side effect) and send out only Disconnected and JobRemoved signals
(needed by bus_wait_for_jobs()) to private buses?



>
> diff --git a/src/libsystemd/sd-bus/bus-message.h
> b/src/libsystemd/sd-bus/bus-message.h
> index 76f0d853d3..16d25e11a7 100644
> --- a/src/libsystemd/sd-bus/bus-message.h
> +++ b/src/libsystemd/sd-bus/bus-message.h
> @@ -119,6 +119,8 @@ struct sd_bus_message {
>          unsigned n_header_offsets;
>
>          uint64_t read_counter;
> +
> +        time_t rq_in;
>  };
>
>  static inline bool BUS_MESSAGE_NEED_BSWAP(sd_bus_message *m) {
> diff --git a/src/libsystemd/sd-bus/bus-socket.c
> b/src/libsystemd/sd-bus/bus-socket.c
> index 07179e0705..6c1bc3e13b 100644
> --- a/src/libsystemd/sd-bus/bus-socket.c
> +++ b/src/libsystemd/sd-bus/bus-socket.c
> @@ -1337,6 +1337,7 @@ static int bus_socket_make_message(sd_bus *bus,
> size_t size) {
>          if (t) {
>                  t->read_counter = ++bus->read_counter;
>                  bus->rqueue[bus->rqueue_size++] =
> bus_message_ref_queued(t, bus);
> +                t->rq_in = time(NULL);
>                  sd_bus_message_unref(t);
>          }
>
> diff --git a/src/libsystemd/sd-bus/sd-bus.c
> b/src/libsystemd/sd-bus/sd-bus.c
> index 22784e8f66..9cc7ce212f 100644
> --- a/src/libsystemd/sd-bus/sd-bus.c
> +++ b/src/libsystemd/sd-bus/sd-bus.c
> @@ -511,6 +511,7 @@ static int synthesize_connected_signal(sd_bus *bus) {
>          memmove(bus->rqueue + 1, bus->rqueue, sizeof(sd_bus_message*) *
> bus->rqueue_size);
>          bus->rqueue[0] = bus_message_ref_queued(m, bus);
>          bus->rqueue_size++;
> +        m->rq_in = time(NULL);
>
>          return 0;
>  }
> @@ -2681,6 +2682,7 @@ static int process_timeout(sd_bus *bus) {
>          struct reply_callback *c;
>          sd_bus_slot *slot;
>          bool is_hello;
> +        unsigned i;
>          usec_t n;
>          int r;
>
> @@ -2695,6 +2697,15 @@ static int process_timeout(sd_bus *bus) {
>          if (c->timeout_usec > n)
>                  return 0;
>
> +        for (i = 0; i < bus->rqueue_size; i++) {
> +                sd_bus_message *q = bus->rqueue[i];
> +
> +                if (c->cookie == q->reply_cookie) {
> +                        log_error("Message was put to read queue %"
> PRIiMAX " seconds ago.", time(NULL) - q->rq_in);
> +                        assert_not_reached();
> +                }
> +        }
> +
>          r = bus_message_new_synthetic_error(
>                          bus,
>                          c->cookie,
> @@ -4464,5 +4475,6 @@ _public_ int sd_bus_enqueue_for_read(sd_bus *bus,
> sd_bus_message *m) {
>                  return r;
>
>          bus->rqueue[bus->rqueue_size++] = bus_message_ref_queued(m, bus);
> +        m->rq_in = time(NULL);
>          return 0;
>  }
>
> Once compiled with above patch and running reproducer for some time I am
> hitting assert_not_reached() and log says that matching reply was in read
> queue for more than 82 seconds! Any ideas or debugging tips would be highly
> appreciated.
>
> Cheers,
> Michal
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/systemd-devel/attachments/20240515/fbf7dc09/attachment.htm>


More information about the systemd-devel mailing list