[systemd-devel] systemd as a docker process manager

Lennart Poettering lennart at poettering.net
Thu Oct 31 17:41:35 UTC 2019


On So, 27.10.19 20:50, Jeff Solomon (jsolomon8080 at gmail.com) wrote:

> This is a followup to this thread:
>
> https://lists.freedesktop.org/archives/systemd-devel/2015-July/033585.html
>
> To see if there are any new developments.
>
> We have multi-process application that already uses systemd successfully.
> Our customers want to put the application into a container and that
> container should be docker because that is what they use. We can't use
> systemd-nspawn or podman or whatever because our customers want to use
> docker because they are already using docker for other applications.
>
> I understand that containers are not a security technology but we want to
> find a solution that allows us to run systemd in a docker container that
> isn't blatantly less secure than systemd running outside of a container. I
> have yet to find a way.
>
> Fundamentally, the problem is that the systemd in the container require
> read/write access to the host's /sys/fs/cgroup/systemd directory in order
> to function at all.

It only requires write access to the subtree it lives in, not to what
lives above it. See how nspawn does it.

> Even if the container isn't privileged, it's necessary
> to mount the host's /sys/fs/cgroup directory inside the directory and let
> the container write to it, you have a security hole that doesn't exist when
> systemd is just run on the host. That hole is described here:

Three options:

1. Docker should use CLONE_NEWCGROUP to get its own cgroup subtree
   hiding what is outside of it.

2. Docker should mount the root of the cgroup tree read-only, only the
   subtree the container is supposed to live in writable.

3. Just use cgroupsv2.

I don't know Docker really, you'd have to enquire them if they support
that. They are a bit behind on these things, but maybe if you ping
them, they will add this for you.

(Of course, systemd-nspawn supports all three of the above-)

> https://blog.trailofbits.com/2019/07/19/understanding-docker-container-escapes/
>
> Using user namespaces doesn't help because then the container user wouldn't
> have permission to write to the /sys/fs/cgroup/systemd.

It doesn't need write acces to that dir, only to the subtree it is
supposed to live in it.

> Our application runs as a non-root user. The security concern is that any
> user on the host who is in the docker group would be able to start a shell
> inside the container as "container root" and then be able to get root on
> the host. So basically membership in the docker group is equivalent to host
> root.
>
> Taking a step back - I wonder (mostly asking Lennart) if there is a way to
> run systemd without it needing access to /sys/fs/cgroup/systemd? I'm sure
> there isn't but I thought I would ask.

no. systemd requires cgroups. But it's fine to mount only the subtree
it needs writable. systemd carefully makes sure that the service
manager never steps beyond its territory, and the access boundaries are clear
and that allows you to carefully arrange the cgroup tree so that only
the subtree and the hierarchy systemd really needs (i.e. the
name=systemd hierarchy) is writable.

(I mean, cgroupsv1 and non-userns containers are not safe anyway, so
you are just closing one gaping hole while leaving many others open,
but of course, this is your choice).

> Is there a way to run systemd's user service without it having the system
> systemd service as a parent?

This is not supported, sorry.

Lennart

--
Lennart Poettering, Berlin


More information about the systemd-devel mailing list