[systemd-devel] I want to run systemd inside of a locked down base docker container

Lennart Poettering lennart at poettering.net
Wed Feb 10 19:14:58 CET 2016


On Wed, 10.02.16 11:36, Daniel J Walsh (dwalsh at redhat.com) wrote:

> >>     systemctl mask systemd-firstboot initrd-udevadm-cleanup-db.service
> >> systemd-udev-settle.service systemd-udev-trigger.service
> >> systemd-udevd.service systemd-udevd-control.socket
> >> systemd-udevd-kernel.socket; \
> > The systemd-firstboot service should have no effect unless you
> > actually boot with an empty /etc (or more accuratily: unless you
> > actually boot with an /etc that lacks /etc/machine-id) . Note that it
> > carries a condition ConditionFirstBoot=yes which makes sure that it
> > isn't even executed in normal cases. 
> I see in the logs systemd complaining about no systemd-firstboot
> command.

Well, what have you installed in the container? Is the
systemd-firstboot binary there? If not, why not? If this has been
split out of the core package, then the service unit for it should
have been split out too, hence there shouldn't be any error about this.

> > Masking all the udev stuff is pretty pointless too. These services are
> > conditioned out in containers too anyway. There's really no need to
> > mask them out. More specifically, they contain
> > ConditionPathIsReadWrite=/sys, i.e. are skipped if /sys is read-only,
> > which is the way how container managers should set up /sys (it's a big
> > security hole to allow containers write access to /sys). My
> > recommendation would be to make sure you container manager implements
> > these recommendations:
> I am just seeing mentions of udev inside of the container, What I don't
> want is messages
> inside of the journal or bootup that look like systemd is trying to run
> firstboot, udev etc.

Sure, that's precisely what the ConditionXYZ= constructs are for: to
skip stuff silently that is not necessary in some cases. 

And by default systemd comes with all the the conditions in place so
that a vanilla systemd image should work fine that implements the
container interface.

> > https://wiki.freedesktop.org/www/Software/systemd/ContainerInterface/
> >
> > If your container manager follows these guidelines (of which the /sys
> > being read-only thing is one), then there should be no special hacks
> > necessary in systemd, as it should just work, and detect the slight
> > semantica changes of containers correctly and avoid them cleanly.
> >
> >>     rm -f /lib/systemd/system/multi-user.target.wants/systemd*
> >>     /lib/systemd/system/multi-user.target.wants/getty*;\
> > What's the rationale for this? First of all, the getty stuff appears
> > entirely unnecessary as getty at .service (which is the only thing
> > generally linked from gettys.target these days) contains
> > ConditionPathExists=/dev/tty0 which means it's already skipped when
> > run on systems lacking a VC (such as containers).
> Again, I am seeing getty@ failures inside of the container.

That would suggest that there's a /dev/tty0 in the container? That
looks really wrong... A container has no virtual console hence there
should be no /dev/tty0.

On Linux /dev/tty0 is a special device node that is part of the
kernel's VC subsystem, and points to the VC currently in the
foreground. It has no place in virtualized systems such as containers.

What is docker mounting as /dev into the container? Does it just bind
mount the host /dev? That's really nasty, as that will expose host
devices and device node ownership to the containers. They really
shouldn't do that and instead mount their own tmpfs to /tmp and just
create the device nodes for /dev/null, /dev/random and so on, but
nothing else.

> > And the other services you are removing here: what's the point? they
> > aren't really optional, that's why they are linked from /usr/lib,
> > rather than subject to systemctl enable/disable...
> >
> >>     sed -i 's/^enable/disable/g' /lib/systemd/system-preset/* 
> > Why would this matter?
> We don't want excess services running inside of a docker container.  I
> only want systemd/journald and any services
> that I enable in the container.   Not something pulled in because the
> installer thinks this is a VM or a Host OS.

Well, the default preset policy in Fedora is to disable everything by
default, modulo a few exceptions. Hence it should be unnecessary to
change anything with the default preset policy, unless you actually
want to *enable* rather than disable more by default...

> Set hostname to <ba64338e2b1a>.
> Running in a container, ignoring fstab device entry for /dev/disk/by-uuid/2cd63037-e967-4e87-b29b-044190721e80.
> sys-fs-fuse-connections.mount: Cannot add dependency job, ignoring: Unit sys-fs-fuse-connections.mount is masked.
> dev-hugepages.mount: Cannot add dependency job, ignoring: Unit dev-hugepages.mount is masked.
> systemd-remount-fs.service: Cannot add dependency job, ignoring: Unit systemd-remount-fs.service is masked.
> systemd-logind.service: Cannot add dependency job, ignoring: Unit systemd-logind.service is masked.
> getty.target: Cannot add dependency job, ignoring: Unit getty.target is masked.
> [OK ] Reached target Encrypted Volumes.
> [OK ] Created slice Root Slice.
> [OK ] Listening on Journal Socket.
> [OK ] Listening on Journal Socket (/dev/log).
> [OK ] Reached target Remote File Systems.
> [OK ] Reached target Paths.
> [OK ] Created slice System Slice.
> ...
> 
> I want to get rid of these mount messages, getty messages systemd-logind messages...

The remount-fs.service is a nop anyway, unless you actually ship stuff
in /etc/fstab, which you shouldn't. Also, you reference a physical
hard disk from /etc/fstab, which makes no sense either in a
container. I'd really recommend to remove /etc/fstab entirely.

I don't see why one would want to mask systemd-logind.service. If you
permit logins and PAM at all, you really need that. 

And masking the getty stuff appears to be entirely unnecessary...

Which leaves the /dev/hugepages and /sys/fs/fuse/connections
mounts. Note sure about those. Are you running the container with
CAP_SYS_ADMIN? If so, then there's no reason to mask those units. If
not, then I figure we could add checks that these are conditioned out
if CAP_SYS_ADMIN is missing.

On nspawn these two aren't seen since nspawn actually doesn't mount
the real sysfs to /sys, but just a tmpfs with a select number of
subdirectories from the real sysfs for security reasons. One of the
subdirs that are suppressed is /sys/fs. Now,
sys-fs-fuse-connections.mount is conditionalized on
/sys/fs/fuse/connections existing, hence if it is not there, then it
won't be mounted. And /dev/hugepages we simply allow to be mounted in
the container.

Lennart

-- 
Lennart Poettering, Red Hat


More information about the systemd-devel mailing list