[systemd-devel] issues with large number of units, systemd 204 and 208 [d10k]

Joe Miller joeym at joeym.net
Thu Oct 17 14:09:15 PDT 2013


--- Quick background:
I work with David Strauss @ Pantheon and systemd is a core part of our
platform. Recently we have been running into some scalability issues with
systemd with the time required for `daemon-reload` to complete. We are
seeing situations where this takes a long time (~ 50s) to complete or times
out (90s). We understand there have been some fixes in systemd 208 that
address issues with dependency calculation that should help speed up
daemon-reload. I have been testing systemd-208  and while I see the
improvement in daemon-reload time I am also observing new issues which I
believe may be related to cgroups.

--- Current setup:
We are currently on fedora-19 with systemd 204. A typical server for us has
about 16,000 units but we would like to get to around 32,000 units which
would be equivalent to about 5000 "containers" for our platform. These
units are split up mostly even between:  .mount, .automount, .socket, and
.service's. Most units are inactive at any given time and started on demand
via socket-activation.

--- New issues with systemd 208 and (maybe) cgroup management?
I am working on running a series of tests right now and hope to have more
numbers for the mailing list shortly. In the meantime, here is a synopsis
of what I am seeing:

First - I have created a test box running f19 + systemd-20 and 16,000
units. This is split between 4000 .mounts, 4000 .automounts, 4000 .sockets,
and 4000 .services (test-X.mount, test-X.automount, test-nginx-X.socket,
test-nginx-X.service). `daemon-reload` takes about 15 seconds to complete
with all else being relatively quiet on the server. Restart/start/stop of
any given service is fast, <1s.

Next - I upgrade to systemd-208, same 16,000 units. systemd completes the
re-exec but sits at 100% cpu forever (or at least hours, I give up
waiting). All attempts to access systemd will timeout. strace of systemd
shows that it is calling open() on every object in the /sys/fs/cgroup tree.
Rebooting the server results in a box that is not able to complete a boot.
Console shows the server sitting very early in the boot process at the
"Welcome to f19" message. I suspect systemd is spinning at 100% and cannot
move forward.

Finally - On a hunch, based on the data observed through strace, I modify
my sample units and create a new .slice for each set of units so that not
all 16,000 units are placed under the default system.slice. Thus I end up
with one additional unit for each set, ie: 4000 * (test-X.slice,
test-X.mount, test-X.automount, test-nginx-X.socket, test-nginx-X.service),
and each of the units in a set is assigned to the relevant test-X.slice.
This works with systemd-204 which ignores the unknown Slice= settings, then
I upgrading to systemd-208 which goes smoother, and a reboot of the server
is successful. `daemon-reload` is fast now:  3s -vs- 15s. However,
start/stop/restart of a service takes 25-30seconds. Strace of systemd shows
a lot of open() activity across the cgroup tree.


I am wondering if something significant changed between 204 and 208 with
regards to handling of cgroups? Restarting a service in 204, strace shows
only a handful of open() calls to cgroup nodes that are relevant to the
service being restarted, but in 208 it appears that systemd may be scanning
the entire cgroup tree..
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/systemd-devel/attachments/20131017/e1cdcc51/attachment.html>


More information about the systemd-devel mailing list