[systemd-devel] [Q] About supporting nested systemd daemon

Sun May 3 18:51:24 PDT 2015

On Thu, 30.04.15 15:42, Alban Crequy (alban at endocode.com) wrote:

> > systemd-nspawn nowadays mounts all hierarchies into the container, but
> > mounts all controller hierarchies read-only, and of the name=systemd
> > hierarchy mounts everything read-only, except the subtree the
> > container is allowed to manage. That way only the cgroup tree the
> > container needs access to is writable to it. That solution however
> > does not hide the cgroup tree. A process running inside the container
> > can still go an explore the tree and its attributes. However, all
> > other groups will appear empty to it, since processes not in the
> > container PID namespaces will be suppressed when reading the member
> > process list.
> 
> To sum up what systemd-nspawn is currently mounting in the container:
> - /sys/fs/cgroup/systemd/  -->  mounted RO
> - /sys/fs/cgroup/systemd/machine.slice/machine-xxx.scope/  --> mounted RW
> - /sys/fs/cgroup/cpu,cpuacct/  -->  mounted RO
> - etc. for other cgroup hierarchies  -->  mounted RO

Correct.

> In order to let systemd in the container restrict cpu, memory, etc. on
> some of its services (see manpage systemd.resource-control(5)), rkt
> would like systemd-nspawn to mount a subtree of some hierarchy
> (cpu,cpuacct, memory) in read-write mode.

That's really not a safe thing to do right now... the kernel isn't
ready for this, as cgroups access is an all-or-nothing thing
currently: if you have access to a cgroup and cane creat child cgroups
in it you have access to *all* attributes you like, the dangerous ones
as well as the not so dangerous ones.

> Is there any issues with changing the systemd-nspawn mounts in the
> following way:
> - /sys/fs/cgroup/systemd/  -->  mounted RO
> - /sys/fs/cgroup/systemd/machine.slice/machine-xxx.scope/  --> mounted RW
> - /sys/fs/cgroup/cpu,cpuacct/  -->  mounted RO
> - /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-xxx.scope/  --> mounted RW
> - etc. for other cgroup hierarchies.
> 
> Iago wrote two experimental patches on systemd-nspawn to try that and
> it worked. Delegate=yes was enabled in systemd-nspawn in order to test
> this:
> https://github.com/endocode/systemd/commits/iaguis/delegate
> 
> But I would like to know what is missing to make this safe (or if it
> is already safe to do).

Well, nspawn does actually not make any guarantees about security
currently. Since we pass CAP_SYS_ADMIN by default to the contaienrs
people can mount whatever they want and remount things freely from
within. Hence, opening this up would not make things much worse.

That said, I am a bit concerned about opening this up by default. Even
though containers are insecure we should try to be safe wherever we
can if it doesn't affect usability too much. 

Adding a new cmdline switch for all of this sounds not too attractive
though, but maybe a --delegate switch would be OK, which would open up
all controllers to the containers.... It would have a similar effect
then on the containers as Delegate=yes has for service processes...

Lennart

-- 
Lennart Poettering, Red Hat