[systemd-devel] Use of namespaced cgroups (aka Docker in systemd-nspawn containers)

Sat Jul 2 01:25:09 UTC 2016

On Mon, 27.06.16 16:58, Lee Hambley (lee.hambley at gmail.com) wrote:

> Hi List,
> 
> My company is currently conducting research into the most viable container
> technology that fits our stack (CentOS based) and given our already
> widespread reliance on systemd, I have a personal stake in preferring not
> to introduce other tooling (LXD, the 2nd place leader) into our stack.
> 
> I'd like to know what is required to fulfil our use-case (Docker in
> LXD/systemd-nspawn)
> 
> Here's what I (think I) know:
> 
>    - Docker can't run in systemd-nspawn because cgroup fs is mounted ro,
>    and the systemd-nspwan container sees the entire system's cgroupfs (no
>    namespacing)

There's a patch waiting in github, to add cgroup namespace support to
nspawn:

https://github.com/systemd/systemd/pull/3589

I am not a Docker guy, but do note that nspawn payloads have write
access to the name=systemd hierarchy below their subtree, and can
delegate that further, hence Docker could work, if it wanted to, as
long as it turns on delegation in its service or asks for a scope with
delegation turned on.

nspawn itself is actually fine with running inside of nspawn (or at
least used to, haven't tested this in a while).

Note that delegation of resource controllers is not safe on cgroupsv1
however, and nspawn hence makes all resource controllers (meaning: all
of "cpu", "memory", "blkio", …) read-only. This will become safe with
cgroupv2. Effectively this means that you can set resource limits on
the outermost container, but not on anything inside of it.

>    - cgroups filesystem normally mounted ro in containers, to protect the
>    host (or, something related to privileged containers)

well, it's not that easy. Today, systemd makes all cgroup controller
hierarchies read-only, except for the name=systemd named hierarchy,
where everything above the container's cgroup subtree is read-only,
but the subtree itself writable.

>       - When mounted rw it can break the host (not the worst problem in the
>       world, we're not defending against malice here, but apparently
> it's trivial
>       to brick the host by having systemd fight over ttys, etc)

well, if we'd mount all cgroup hierarchies writable, inclduing the
various resource controller hiearchies, and everything above the
container's subtree in the name=systemd hierarchy, then this would be
a major security problem. First of all, controller delegation is not
safe on cgroupv1 (as mentioned above), and secondly this would enable
the container to interfere with the host's cgroup tree, which is
highly problematic.

That said, containers on Linux are not a security concept really
anyway, there are more holes in the entire model than in swiss
cheese. But we should at least close the holes we are aware of.

>       - it might be fair to say that privilidged containers
>    - namespaces cgroups are relatively new in linux
>       - available 4.6 [1]
>       - backported to 4.4+ on Ubuntu kernels
>    - We think LXD does something around setns() [2] to make sure that the
>    container has a correct view of the cgroup "subtree".

yes, cgroup namespaces are very new. Also, they only make full sense
on cgroupsv2 as delegation isn't safe on cgroupsv1 anyway.

> I suspect something can be done in .nspawn files to grant certain
> privileges to work around issues related to ro/rw cgroups trees, etc but I
> think systemd-nspawn has to know about creating the correct cgroup
> hierarchy before passing control to the

As mentioned, if Docker wants to it could work just fine inside of an
nspawn container, it won't have access to any controllers, but it gets
enough write access to delegate things further.

Lennart

-- 
Lennart Poettering, Red Hat