[systemd-devel] Delegate= on slice before v237

Fri Feb 15 18:17:26 UTC 2019

On Mi, 13.02.19 11:52, Filipe Brandenburger (filbranden at gmail.com) wrote:

> > The fact it wasn't refused outright was an
> > accident, and because it was one I am not entirely sure what the
> > precise effect of allowing it was. However, I am pretty sure it at
> > least had two effects:
> >
> > 1. it would turn on all controllers for the cgroup
> >
>
> I don't *think* this is why libcontainer was trying to enable it, since a
> few lines down it's explicitly enabling all the controllers by
> setting MemoryAccounting, CPUAccounting and BlockIOAccounting during
> transient unit creation:
> https://github.com/opencontainers/runc/blob/v1.0.0-rc6/libcontainer/cgroups/systemd/apply_systemd.go#L275

Note that on current systemd+kernel CPUAccounting=yes has no effect on cgroup
controller enablement since the kernel now provides the accounting
data always, independently of the CPU controller.

> > 2. it would stop systemd to ever migrating foreign processes below
> >    that slice, which is primarily relevant only when changing cgroup
> >    related props on the slice dynamically I guess.
>
> I'm not sure I follow... Do you mean if libcontainer would write
> to memory.limit_in_bytes (or one of the other properties of the memory or
> other controller managed by systemd, such as cpu), then systemd would not
> end up overwriting this as it does some other operation on the
> cgroup?

No, not that. (Note btw, that the cgroup resource attrs on the
delegated cgroup itself remain exclusively owned by systemd. Only the
cgroup attrs of cgroups created below the delegated are owned by the
service payload. i.e. delegation is a bit weird in that regard:
"cgroup.groups" is delegated to the payload while "cpu.weight" or
"pids.max" is not.)

> I'm not completely sure I understand what "migrate foreign processes"
> means, given slices don't really hold any pids directly... Do you mean to
> scope and service units below that slice?

If systemd makes changes to the cgroup tree then it will consider the
"name=systemd" hierarchy authoritative, and the process memberships
are the ones systemd primarily manages, and that are the "primary
sources of truth". The other controller hierarchies are secondary to
that, and are only managed in a way that they are made "reduced
copies" of the "name=systemd" hierarchy. Meaning: when systemd wants
to know in which cgroup PID X is it checks the name=systemd hierarchy,
and when it wants to migrate X elsewhere it does so first in the
name=systemd, and then syncs the other hierarchies to the
"name=systemd" to the depth needed. Now, "to the depth needed" is more
complex than it might sound: due to "controllers are enabled in
parent" logic cgroup controller memberships propagate "side-ways" in
the cgroup tree. i.e. if the service foobar.service in slice
waldo.slice gets IOAccounting=yes enabled, then this means "blkio"
will be enabled also for any sibling quux.service or miau.service if
those are also in the same slice waldo.slice. If that happens during
runtime (and it often does, since starting quux.service first, and
then also starting foobar.service means "blkio" needs to be turned
on for the already running quux.service), then this means the blkio cgroup
for quux.service gets realized and then all processes in the
name=systemd cgroup of quux.service propagated into it, i.e. the blkio
hierarchy gets synced to the name=systemd hierarchy one level further
down, if you follow what I am blabbering here.

Now, if services have Delegate=yes turned on for themselves, then
systemd will never do such synchronization of cgroup trees ever: the
cgroup controller memberships are set in stone, and systemd will never
migrate anything below the delegated hierachy, since the payload of
course has freedom to distribute the pids among its heirarchies any
way it wants, and systemd should never interfere with it and rearrange
things moving PIDs to the top of the delegate cgroup subtree just
because a sibling was modified.

> In any case, for now I'll probably leave that alone... Though as I revamp
> libcontainer support for unified hierarchy, I'll try to skip that check on
> that case, that might make this a legacy-only setting, so not that
> important to fully get rid of it for a while...

Doing Delegate=yes on a slice is really broken. In systemd we lacked a
safety check to disallow this and the fact that runc used that is
exploiting a bug, nothing else.

Lennart

[6~--
Lennart Poettering, Red Hat