[systemd-devel] I want to run systemd inside of a locked down base docker container

Wed Feb 10 22:27:49 CET 2016

On Wed, 10.02.16 15:58, Daniel J Walsh (dwalsh at redhat.com) wrote:
> >>>>     sed -i 's/^enable/disable/g' /lib/systemd/system-preset/* 
> >>> Why would this matter?
> >> We don't want excess services running inside of a docker container.  I
> >> only want systemd/journald and any services
> >> that I enable in the container.   Not something pulled in because the
> >> installer thinks this is a VM or a Host OS.
> > Well, the default preset policy in Fedora is to disable everything by
> > default, modulo a few exceptions. Hence it should be unnecessary to
> > change anything with the default preset policy, unless you actually
> > want to *enable* rather than disable more by default...
>
> Here is what I see enabled in the base container.  I don't think we
> want any of this stuff running by default in a docker container.

[…]

Well, but pretty much all the units you listed here are units from
RPMs you wouldn't install in a container anyway, aren't they? This,
they shouldn't matter anyway, and I'd argue they should be enabled by
default in a container too – if they are installed explicitly by the
user, through RPM. Hence, I think patching the preset stuff is not
necessary at all.

> > I don't see why one would want to mask systemd-logind.service. If you
> > permit logins and PAM at all, you really need that. 
>
> If I wanted to add a login program I could enable/unmask these.
> No one runs docker containers as login services, that would require
> getty. 

Well, "machinectl shell", "cron" and all those things do PAM... In
fact the fact that "machinectl shell" goes through PAM and registers
with logind through that is one of the major benefits over naked
"nsenter".

I can see that you don't want to run it by default, but maybe we can
rearrange things so that logind is started on first use (i.e. on the
first PAM conversation). That way logind would normally not run in a
container, until it is actually requested by PAM conversation. We
could even add exit-on-idle so that it goes away after a while when
the user logs out again.

That way logind could stay available but would normally not appear in
"ps" unless it is actually used.

I added this to the TODO list now.

> > And masking the getty stuff appears to be entirely unnecessary...
> Again the goal is just to get rid of the getty failure message at
> bootup.

But there should really be none with current systemd, as you don't
have /dev/tty0 and the getty unit has ConditionPathExists=/dev/tty0. 

How precisely does the getty message look like that you get?

> > Which leaves the /dev/hugepages and /sys/fs/fuse/connections
> > mounts. Note sure about those. Are you running the container with
> > CAP_SYS_ADMIN? If so, then there's no reason to mask those units. If
> > not, then I figure we could add checks that these are conditioned out
> > if CAP_SYS_ADMIN is missing.
>
> No docker containers do not enable SYS_ADMIN or NET_ADMIN by
> default.

I'll add a ConditionCapability=CAP_SYS_ADMIN line to the fuse
mount. The hugepages mount already has one (since 218).

With that addition there should really be no reason to mask out either
of the units explicitly, systemd should already silently skip them in
a docker setup where CAP_SYS_ADMIN is missing.

> > On nspawn these two aren't seen since nspawn actually doesn't mount
> > the real sysfs to /sys, but just a tmpfs with a select number of
> > subdirectories from the real sysfs for security reasons. One of the
> > subdirs that are suppressed is /sys/fs. Now,
> > sys-fs-fuse-connections.mount is conditionalized on
> > /sys/fs/fuse/connections existing, hence if it is not there, then it
> > won't be mounted. And /dev/hugepages we simply allow to be mounted in
> > the container.
>
> Interesting idea.  Maybe we should just mount over /sys/fs also.

Well, note that we over-mount /sys with a tmpfs, and then some parts
of the real /sys into that. /sys/fs hence is just a subdir of our
private tmpfs. The tmpfs is marked r/o after everything is set up.

> Do you just mount hugepages then during container setup?

No. In nspawn, when we pass CAP_SYS_ADMIN to the container the
container will just mount /dev/hugepages correctly on its own. And we
do drop CAP_SYS_ADMIN then the ConditionCapability=CAP_SYS_ADMIN in
the unit file mentioned above will result in the mount being skipped
silently already.

Lennart

-- 
Lennart Poettering, Red Hat