[systemd-devel] [HEADSUP] cgroup changes

Fri Jun 21 10:36:03 PDT 2013

Heya,

On monday I posted this mail:

http://lists.freedesktop.org/archives/systemd-devel/2013-June/011388.html

Here's an update and a bit on the bigger picture:

Half of what I mentioned there is now in place. There's now a new
"slice" unit type in place in git, and everything is hooked up to
it. logind will now also keep track of running containers/VMs. The
various container/VM managers have to register with logind now. This
serves the purpose of better integration of containers/VMs everywhere
(so that "ps" can show for each process where it belongs to). However,
the main reason for this is that this is eventually going to be the only
way how containers/VMs can get a cgroup of their own.

So, in that context, a bit of the bigger picture:

It took us a while to realize the full extent how awfully unusable
cgroups currently are. The attributes have way more interdependencies
than people might think and it is trivial to create non-sensical
configurations...

Of course, understanding how awful the status quo is a good first
step. But we really needed to figure out what we can do about this to
clean this up in the long run, and how we can get to something useful
quickly. So, after much discussion between Tejun (the kernel cgroup
maintainer) and various other folks here's the new scheme that we want
to go for:

1) In the long run there's only going to be a single kernel cgroup
hierarchy, the per-controller hierarchies will go away. The single
hierarchy will allow controllers to be individually enabled for each
cgroup. The net effect is that the hierarchies the controllers see are
not orthogonal anymore, they are always subtrees of the full single
hierarchy.

2) This hierarchy becomes private property of systemd. systemd will set
it up. Systemd will maintain it. Systemd will rearrange it. Other
software that wants to make use of cgroups can do so only through
systemd's APIs. This single-writer logic is absolutely necessary, since
interdependencies between the various controllers, the various
attributes, the various cgroups are non-obvious and we simply cannot
allow that cgroup users alter the tree independently of each other
forever. Due to all this: The "Pax Cgroup" document is a thing of the
past, it is dead.

3) systemd will hide the fact that cgroups are internally used almost
entirely. In fact, we will take away the unit configuration options
ControlGroup=, ControlGroupModify=, ControlGroupPersistent=,
ControlGroupAttribute= in their entirety. The high-level options
CPUShares=, MemoryLimit=, .. and so on will continue to exist and we'll
add additional ones like them. The system.conf setting
DefaultControllers=cpu will go away too. Basically, you'll get more
high-level settings, but all the low level bits will go away without
replacement. We will take away the ability for the admin to set
arbitrary low-level attributes, to arrange things in completely
arbitrary cgroup trees or to enable arbitrary controllers for a service.

4) systemd git introduced a new unit type called "slice" (see
above). This is for partitioning up resources of the system into
slices. Slices are hierarchial, and other units (such as services, but
also containers/VMs and logged in users) can then be assigned to these
slices. Slices internally map to cgroups, but they are a very high-level
construct. Slices will expose the same CPUShares=, MemoryLimit=
properties as the other units do. This means resource management will
become a first-class, built-in functionality of systemd. You can create
slices for your customers, and in them subslices for their departments,
and then run services, users, vms in them. In the long run these will by
dynamically moveable even (while they are running), but that'll take
more kernel work. By default there will three slices: "system.slice"
(where all system services are located by default), "user.slice" (where
all logged in users are located by default), "machine.slice" (where all
running VMs/containers are located by default). However, the admin will
have full freedom to create arbitary slices and then move the other
units into them.

5) systemd's logind daemon already kept track of logged in
users/sessions. It is now extended to also keep track of virtual
machines/containers. In fact, this is how libvirt/nspawn and friends
will now get their own cgroups. They register as a machine, which means
passing a bit of meta info to systemd, and getting a cgroup assigned in
response. This registration ensures that "ps" and friends can show to
which VM/container a process belongs, but easily allows other tools to
query container/VM info too, so that we'll be able to provide an
integration level of containers/VMs like solaris zones can do it in the
long run.

So, this all together sounds like an awful lot of change. #1 and #2 are
long term changes. However #3, #4, #5 are something we can do now and
should do now, as prepartion for the single-writer, unified cgroup
tree. We really, really shouldn't ship the cgroup mess for longer, so
that people make use of the current systemd APIs that expose way too
many internal guts, stuff that we *know* right now is broken and will
cease to exist. We don't want to expose low-level details we already
know *now* we cannot support for long.

Even though #3, #4, #5 sound like major work they are not. In fact #4
and #5 are fully implemented on the systemd side already now upstream. I
am working on #3. I am confident that I'll have this finished in a few
days too, since this is really actually just about deleting code more
than writing code.

With #3, #4, #5 we have something in place that should do the basic
things and first and foremost will hide all the lower-level details of
cgroups. This has the big benefit of allowing us to rearrange these
details later without having to break the user or
programming interfaces, and that's what I really care about here.

Now, what does this mean for other projects using cgroups? So basically,
since we won't implement #1 + #2 immediately the cgroup tree stays
relatively open for other cgroup users. They can continue to fiddle with
it for now, but it must be clear that this is temporary, and that they
don't attempt too fancy things. Direct access to the cgroup tree is on
is way out and that must be clear to everybody.

More specifically: libcgroup is out of the game with
this. libvirt/openshift/lxc/.. can continue to do what they do for now,
however they should be updated sooner rather than later to do things the
systemd way, i.e. rely on systemd VM/container registration and user
cgroup management.

And to make one last thing clear: this time, it's not Kay and me who are
taking away the cgroup tree from everybody else, it's actually all
Tejun's fault as the kernel cgroup maintainer... ;-) He wants a unified,
single-writer hierarchy, and it took us a while to agree to that, but
we're now fully on the same page with him.

If you are using non-trivial cgroup setups with systemd right now, then
things will change for you. We will provide you with similar
functionality as before, but things will be different and less
low-level. As long as you only used the high-level options such as
CPUShares, MemoryLimit and so on you should be on the safe side.

I hope this makes some sense,

Lennart

-- 
Lennart Poettering - Red Hat, Inc.