[systemd-devel] [Q] About supporting nested systemd daemon

Alban Crequy alban at endocode.com
Thu Apr 30 06:42:25 PDT 2015


On Wed, Feb 25, 2015 at 6:48 PM, Lennart Poettering
<lennart at poettering.net> wrote:
> On Wed, 25.02.15 00:05, Cyrill Gorcunov (gorcunov at gmail.com) wrote:
>
>> Hi all! I would really appreciate if someone enlighten me if there is some simple
>> solution for the problem we met in OpenVZ: modern containers are mostly systemd
>> based so that once it is started up the systemd daemon mounts own instance of
>> the systemd cgroup (if previously has not been pre-mounted by container startup
>> tools or whatever). To make a strict isolation of nested systemd cgroup (by
>> "nested" I mean systemd cgroup instance mounted inside container) we've patched
>> the kernel so that container's systemd obtains own instance of cgroup non-intersected
>> anyhow with one present on a host system.
>>
>> And we would really love to get rid of this kind of kernel's hack but be able
>> to isolate nested systemd with own cgroup instance using solely userspace
>> tools. Is there some way to reach this?
>
> Not really. cgroupfs doesn't really allow that. First of all the root
> cgroup has a different set of attributes than child cgroups, hence you
> cannot mount an arbitrary child to the root cgroup and assume it
> works. But even worse, /proc/$PID/cgroup actually contains the full
> cgroup path, and hence mounting only a subtree would break the
> refernces from that file.
>
> systemd-nspawn nowadays mounts all hierarchies into the container, but
> mounts all controller hierarchies read-only, and of the name=systemd
> hierarchy mounts everything read-only, except the subtree the
> container is allowed to manage. That way only the cgroup tree the
> container needs access to is writable to it. That solution however
> does not hide the cgroup tree. A process running inside the container
> can still go an explore the tree and its attributes. However, all
> other groups will appear empty to it, since processes not in the
> container PID namespaces will be suppressed when reading the member
> process list.

To sum up what systemd-nspawn is currently mounting in the container:
- /sys/fs/cgroup/systemd/  -->  mounted RO
- /sys/fs/cgroup/systemd/machine.slice/machine-xxx.scope/  --> mounted RW
- /sys/fs/cgroup/cpu,cpuacct/  -->  mounted RO
- etc. for other cgroup hierarchies  -->  mounted RO

In order to let systemd in the container restrict cpu, memory, etc. on
some of its services (see manpage systemd.resource-control(5)), rkt
would like systemd-nspawn to mount a subtree of some hierarchy
(cpu,cpuacct, memory) in read-write mode.

Is there any issues with changing the systemd-nspawn mounts in the
following way:
- /sys/fs/cgroup/systemd/  -->  mounted RO
- /sys/fs/cgroup/systemd/machine.slice/machine-xxx.scope/  --> mounted RW
- /sys/fs/cgroup/cpu,cpuacct/  -->  mounted RO
- /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-xxx.scope/  --> mounted RW
- etc. for other cgroup hierarchies.

Iago wrote two experimental patches on systemd-nspawn to try that and
it worked. Delegate=yes was enabled in systemd-nspawn in order to test
this:
https://github.com/endocode/systemd/commits/iaguis/delegate

But I would like to know what is missing to make this safe (or if it
is already safe to do).

> There have been proposals on LKML to add cgroup namespacings, but no
> idea where that went.
>
> LXC created a FUSE emulation of /proc and /sys, called lxcfs to solve
> this problem. Quite honestly I find this a pretty crazy idea however.
>
>> If I understand correctly we can provide separate slice to container's
>> systemd leaving the rest of host cgroup in ro mode, right?
>
> Yes.
>
>> If so maybe there a way to hide host cgroup completely from
>> container so it would see only own cgroup in sysfs?
>
> I don't see how this could work. I mean, you could overmount all other
> cgroup siblings with empty directories in the containers, but not
> realy scalable nor compatible with cgroups being added or removed
> later on...
>
> Lennart
>
> --
> Lennart Poettering, Red Hat
> _______________________________________________
> systemd-devel mailing list
> systemd-devel at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/systemd-devel


More information about the systemd-devel mailing list