[systemd-devel] [Q] About supporting nested systemd daemon

Wed Feb 25 09:48:20 PST 2015

On Wed, 25.02.15 00:05, Cyrill Gorcunov (gorcunov at gmail.com) wrote:

> Hi all! I would really appreciate if someone enlighten me if there is some simple
> solution for the problem we met in OpenVZ: modern containers are mostly systemd
> based so that once it is started up the systemd daemon mounts own instance of
> the systemd cgroup (if previously has not been pre-mounted by container startup
> tools or whatever). To make a strict isolation of nested systemd cgroup (by
> "nested" I mean systemd cgroup instance mounted inside container) we've patched
> the kernel so that container's systemd obtains own instance of cgroup non-intersected
> anyhow with one present on a host system.
> 
> And we would really love to get rid of this kind of kernel's hack but be able
> to isolate nested systemd with own cgroup instance using solely userspace
> tools. Is there some way to reach this?

Not really. cgroupfs doesn't really allow that. First of all the root
cgroup has a different set of attributes than child cgroups, hence you
cannot mount an arbitrary child to the root cgroup and assume it
works. But even worse, /proc/$PID/cgroup actually contains the full
cgroup path, and hence mounting only a subtree would break the
refernces from that file.

systemd-nspawn nowadays mounts all hierarchies into the container, but
mounts all controller hierarchies read-only, and of the name=systemd
hierarchy mounts everything read-only, except the subtree the
container is allowed to manage. That way only the cgroup tree the
container needs access to is writable to it. That solution however
does not hide the cgroup tree. A process running inside the container
can still go an explore the tree and its attributes. However, all
other groups will appear empty to it, since processes not in the
container PID namespaces will be suppressed when reading the member
process list.

There have been proposals on LKML to add cgroup namespacings, but no
idea where that went.

LXC created a FUSE emulation of /proc and /sys, called lxcfs to solve
this problem. Quite honestly I find this a pretty crazy idea however.

> If I understand correctly we can provide separate slice to container's
> systemd leaving the rest of host cgroup in ro mode, right?

Yes.

> If so maybe there a way to hide host cgroup completely from
> container so it would see only own cgroup in sysfs?

I don't see how this could work. I mean, you could overmount all other
cgroup siblings with empty directories in the containers, but not
realy scalable nor compatible with cgroups being added or removed
later on...

Lennart

-- 
Lennart Poettering, Red Hat