[systemd-devel] I want to run systemd inside of a locked down base docker container

Daniel J Walsh dwalsh at redhat.com
Wed Feb 10 21:58:23 CET 2016



On 02/10/2016 01:14 PM, Lennart Poettering wrote:
> On Wed, 10.02.16 11:36, Daniel J Walsh (dwalsh at redhat.com) wrote:
>
>>>>     systemctl mask systemd-firstboot initrd-udevadm-cleanup-db.service
>>>> systemd-udev-settle.service systemd-udev-trigger.service
>>>> systemd-udevd.service systemd-udevd-control.socket
>>>> systemd-udevd-kernel.socket; \
>>> The systemd-firstboot service should have no effect unless you
>>> actually boot with an empty /etc (or more accuratily: unless you
>>> actually boot with an /etc that lacks /etc/machine-id) . Note that it
>>> carries a condition ConditionFirstBoot=yes which makes sure that it
>>> isn't even executed in normal cases. 
>> I see in the logs systemd complaining about no systemd-firstboot
>> command.
> Well, what have you installed in the container? Is the
> systemd-firstboot binary there? If not, why not? If this has been
> split out of the core package, then the service unit for it should
> have been split out too, hence there shouldn't be any error about this.
>
>>> Masking all the udev stuff is pretty pointless too. These services are
>>> conditioned out in containers too anyway. There's really no need to
>>> mask them out. More specifically, they contain
>>> ConditionPathIsReadWrite=/sys, i.e. are skipped if /sys is read-only,
>>> which is the way how container managers should set up /sys (it's a big
>>> security hole to allow containers write access to /sys). My
>>> recommendation would be to make sure you container manager implements
>>> these recommendations:
>> I am just seeing mentions of udev inside of the container, What I don't
>> want is messages
>> inside of the journal or bootup that look like systemd is trying to run
>> firstboot, udev etc.
> Sure, that's precisely what the ConditionXYZ= constructs are for: to
> skip stuff silently that is not necessary in some cases. 
>
> And by default systemd comes with all the the conditions in place so
> that a vanilla systemd image should work fine that implements the
> container interface.
>
>>> https://wiki.freedesktop.org/www/Software/systemd/ContainerInterface/
>>>
>>> If your container manager follows these guidelines (of which the /sys
>>> being read-only thing is one), then there should be no special hacks
>>> necessary in systemd, as it should just work, and detect the slight
>>> semantica changes of containers correctly and avoid them cleanly.
>>>
>>>>     rm -f /lib/systemd/system/multi-user.target.wants/systemd*
>>>>     /lib/systemd/system/multi-user.target.wants/getty*;\
>>> What's the rationale for this? First of all, the getty stuff appears
>>> entirely unnecessary as getty at .service (which is the only thing
>>> generally linked from gettys.target these days) contains
>>> ConditionPathExists=/dev/tty0 which means it's already skipped when
>>> run on systems lacking a VC (such as containers).
>> Again, I am seeing getty@ failures inside of the container.
> That would suggest that there's a /dev/tty0 in the container? That
> looks really wrong... A container has no virtual console hence there
> should be no /dev/tty0.
> On Linux /dev/tty0 is a special device node that is part of the
> kernel's VC subsystem, and points to the VC currently in the
> foreground. It has no place in virtualized systems such as containers.
> What is docker mounting as /dev into the container? Does it just bind
> mount the host /dev? That's really nasty, as that will expose host
> devices and device node ownership to the containers. They really
> shouldn't do that and instead mount their own tmpfs to /tmp and just
> create the device nodes for /dev/null, /dev/random and so on, but
> nothing else.
They don't, they create their own /dev inside the container with locked
down devices.

ls -lZ /dev/
total 0
crw-------. 1 root root system_u:object_r:svirt_sandbox_file_t:s0:c15,c706 136,   3 Feb 10 20:42 console
lrwxrwxrwx. 1 root root system_u:object_r:svirt_sandbox_file_t:s0:c15,c706       13 Feb 10 20:42 fd -> /proc/self/fd
crw-rw-rw-. 1 root root system_u:object_r:svirt_sandbox_file_t:s0:c15,c706   1,   7 Feb 10 20:42 full
c---------. 1 root root system_u:object_r:svirt_sandbox_file_t:s0:c15,c706  10, 229 Feb 10 20:42 fuse
lrwxrwxrwx. 1 root root system_u:object_r:svirt_sandbox_file_t:s0:c15,c706       11 Feb 10 20:42 kcore -> /proc/kcore
drwxrwxrwt. 2 root root system_u:object_r:svirt_sandbox_file_t:s0:c15,c706       40 Feb 10 20:42 mqueue
crw-rw-rw-. 1 root root system_u:object_r:svirt_sandbox_file_t:s0:c15,c706   1,   3 Feb 10 20:42 null
lrwxrwxrwx. 1 root root system_u:object_r:svirt_sandbox_file_t:s0:c15,c706        8 Feb 10 20:42 ptmx -> pts/ptmx
drwxr-xr-x. 2 root root system_u:object_r:svirt_sandbox_file_t:s0:c15,c706        0 Feb 10 20:42 pts
crw-rw-rw-. 1 root root system_u:object_r:svirt_sandbox_file_t:s0:c15,c706   1,   8 Feb 10 20:42 random
drwxrwxrwt. 2 root root system_u:object_r:svirt_sandbox_file_t:s0:c15,c706       40 Feb 10 20:42 shm
lrwxrwxrwx. 1 root root system_u:object_r:svirt_sandbox_file_t:s0:c15,c706       15 Feb 10 20:42 stderr -> /proc/self/fd/2
lrwxrwxrwx. 1 root root system_u:object_r:svirt_sandbox_file_t:s0:c15,c706       15 Feb 10 20:42 stdin -> /proc/self/fd/0
lrwxrwxrwx. 1 root root system_u:object_r:svirt_sandbox_file_t:s0:c15,c706       15 Feb 10 20:42 stdout -> /proc/self/fd/1
crw-rw-rw-. 1 root root system_u:object_r:svirt_sandbox_file_t:s0:c15,c706   5,   0 Feb 10 20:42 tty
crw-rw-rw-. 1 root root system_u:object_r:svirt_sandbox_file_t:s0:c15,c706   1,   9 Feb 10 20:42 urandom
crw-rw-rw-. 1 root root system_u:object_r:svirt_sandbox_file_t:s0:c15,c706   1,   5 Feb 10 20:42 zero


This is what a standard /dev /looks like in a container



>>> And the other services you are removing here: what's the point? they
>>> aren't really optional, that's why they are linked from /usr/lib,
>>> rather than subject to systemctl enable/disable...
>>>
>>>>     sed -i 's/^enable/disable/g' /lib/systemd/system-preset/* 
>>> Why would this matter?
>> We don't want excess services running inside of a docker container.  I
>> only want systemd/journald and any services
>> that I enable in the container.   Not something pulled in because the
>> installer thinks this is a VM or a Host OS.
> Well, the default preset policy in Fedora is to disable everything by
> default, modulo a few exceptions. Hence it should be unnecessary to
> change anything with the default preset policy, unless you actually
> want to *enable* rather than disable more by default...
Here is what I see enabled in the base container.  I don't think we want
any of this stuff running by default in a
docker container.

grep ^enable /lib/systemd/system-preset/*
/lib/systemd/system-preset/85-display-manager.preset:enable gdm.service
/lib/systemd/system-preset/85-display-manager.preset:enable lightdm.service
/lib/systemd/system-preset/85-display-manager.preset:enable slim.service
/lib/systemd/system-preset/85-display-manager.preset:enable lxdm.service
/lib/systemd/system-preset/85-display-manager.preset:enable sddm.service
/lib/systemd/system-preset/85-display-manager.preset:enable kdm.service
/lib/systemd/system-preset/85-display-manager.preset:enable xdm.service
/lib/systemd/system-preset/90-default.preset:enable sshd.service
/lib/systemd/system-preset/90-default.preset:enable atd.*
/lib/systemd/system-preset/90-default.preset:enable crond.*
/lib/systemd/system-preset/90-default.preset:enable chronyd.service
/lib/systemd/system-preset/90-default.preset:enable NetworkManager.service
/lib/systemd/system-preset/90-default.preset:enable
NetworkManager-dispatcher.service
/lib/systemd/system-preset/90-default.preset:enable ModemManager.service
/lib/systemd/system-preset/90-default.preset:enable auditd.service
/lib/systemd/system-preset/90-default.preset:enable restorecond.service
/lib/systemd/system-preset/90-default.preset:enable bluetooth.*
/lib/systemd/system-preset/90-default.preset:enable avahi-daemon.*
/lib/systemd/system-preset/90-default.preset:enable cups.*
/lib/systemd/system-preset/90-default.preset:enable rsyslog.*
/lib/systemd/system-preset/90-default.preset:enable syslog-ng.*
/lib/systemd/system-preset/90-default.preset:enable sysklogd.*
/lib/systemd/system-preset/90-default.preset:enable firewalld.service
/lib/systemd/system-preset/90-default.preset:enable libvirtd.service
/lib/systemd/system-preset/90-default.preset:enable xinetd.service
/lib/systemd/system-preset/90-default.preset:enable multipathd.service
/lib/systemd/system-preset/90-default.preset:enable libstoragemgmt.service
/lib/systemd/system-preset/90-default.preset:enable lvm2-monitor.*
/lib/systemd/system-preset/90-default.preset:enable lvm2-lvmetad.*
/lib/systemd/system-preset/90-default.preset:enable dm-event.*
/lib/systemd/system-preset/90-default.preset:enable
dmraid-activation.service
/lib/systemd/system-preset/90-default.preset:enable mdmonitor.service
/lib/systemd/system-preset/90-default.preset:enable
mdmonitor-takeover.service
/lib/systemd/system-preset/90-default.preset:enable spice-vdagentd.service
/lib/systemd/system-preset/90-default.preset:enable qemu-guest-agent.service
/lib/systemd/system-preset/90-default.preset:enable dnf-makecache.timer
/lib/systemd/system-preset/90-default.preset:enable vmtoolsd.service
/lib/systemd/system-preset/90-default.preset:enable dkms.service
/lib/systemd/system-preset/90-default.preset:enable ipmi.service
/lib/systemd/system-preset/90-default.preset:enable ipmievd.service
/lib/systemd/system-preset/90-default.preset:enable x509watch.timer
/lib/systemd/system-preset/90-default.preset:enable dnssec-triggerd.service
/lib/systemd/system-preset/90-default.preset:enable uuidd.socket
/lib/systemd/system-preset/90-default.preset:enable gpm.*
/lib/systemd/system-preset/90-default.preset:enable gpsd.socket
/lib/systemd/system-preset/90-default.preset:enable
x2gocleansessions.service
/lib/systemd/system-preset/90-default.preset:enable unbound-anchor.timer
/lib/systemd/system-preset/90-default.preset:enable lvm2-lvmpolld.*
/lib/systemd/system-preset/90-default.preset:enable dbxtool.service
/lib/systemd/system-preset/90-default.preset:enable irqbalance.service
/lib/systemd/system-preset/90-default.preset:enable lm_sensors.service
/lib/systemd/system-preset/90-default.preset:enable mcelog.*
/lib/systemd/system-preset/90-default.preset:enable smartd.service
/lib/systemd/system-preset/90-default.preset:enable pcscd.socket
/lib/systemd/system-preset/90-default.preset:enable rngd.service
/lib/systemd/system-preset/90-default.preset:enable abrtd.service
/lib/systemd/system-preset/90-default.preset:enable abrt-ccpp.service
/lib/systemd/system-preset/90-default.preset:enable abrt-oops.service
/lib/systemd/system-preset/90-default.preset:enable abrt-xorg.service
/lib/systemd/system-preset/90-default.preset:enable abrt-vmcore.service
/lib/systemd/system-preset/90-default.preset:enable ksm.service
/lib/systemd/system-preset/90-default.preset:enable ksmtuned.service
/lib/systemd/system-preset/90-default.preset:enable rootfs-resize.service
/lib/systemd/system-preset/90-default.preset:enable sysstat.service
/lib/systemd/system-preset/90-default.preset:enable sysstat-collect.timer
/lib/systemd/system-preset/90-default.preset:enable sysstat-summary.timer
/lib/systemd/system-preset/90-default.preset:enable uuidd.service
/lib/systemd/system-preset/90-default.preset:enable xendomains.service
/lib/systemd/system-preset/90-default.preset:enable xenstored.service
/lib/systemd/system-preset/90-default.preset:enable xenconsoled.service
/lib/systemd/system-preset/90-default.preset:enable accounts-daemon.service
/lib/systemd/system-preset/90-default.preset:enable rtkit-daemon.service
/lib/systemd/system-preset/90-default.preset:enable upower.service
/lib/systemd/system-preset/90-default.preset:enable udisks2.service
/lib/systemd/system-preset/90-default.preset:enable polkit.service
/lib/systemd/system-preset/90-default.preset:enable timedatex.service
/lib/systemd/system-preset/90-default.preset:enable mlocate-updatedb.timer
/lib/systemd/system-preset/90-default.preset:enable sa-update.timer
/lib/systemd/system-preset/90-systemd.preset:enable remote-fs.target
/lib/systemd/system-preset/90-systemd.preset:enable machines.target
/lib/systemd/system-preset/90-systemd.preset:enable getty at .service
/lib/systemd/system-preset/90-systemd.preset:enable
systemd-timesyncd.service
/lib/systemd/system-preset/90-systemd.preset:enable systemd-networkd.service
/lib/systemd/system-preset/90-systemd.preset:enable systemd-resolved.service

>> Set hostname to <ba64338e2b1a>.
>> Running in a container, ignoring fstab device entry for /dev/disk/by-uuid/2cd63037-e967-4e87-b29b-044190721e80.
>> sys-fs-fuse-connections.mount: Cannot add dependency job, ignoring: Unit sys-fs-fuse-connections.mount is masked.
>> dev-hugepages.mount: Cannot add dependency job, ignoring: Unit dev-hugepages.mount is masked.
>> systemd-remount-fs.service: Cannot add dependency job, ignoring: Unit systemd-remount-fs.service is masked.
>> systemd-logind.service: Cannot add dependency job, ignoring: Unit systemd-logind.service is masked.
>> getty.target: Cannot add dependency job, ignoring: Unit getty.target is masked.
>> [OK ] Reached target Encrypted Volumes.
>> [OK ] Created slice Root Slice.
>> [OK ] Listening on Journal Socket.
>> [OK ] Listening on Journal Socket (/dev/log).
>> [OK ] Reached target Remote File Systems.
>> [OK ] Reached target Paths.
>> [OK ] Created slice System Slice.
>> ...
>>
>> I want to get rid of these mount messages, getty messages systemd-logind messages...
> The remount-fs.service is a nop anyway, unless you actually ship stuff
> in /etc/fstab, which you shouldn't. Also, you reference a physical
> hard disk from /etc/fstab, which makes no sense either in a
> container. I'd really recommend to remove /etc/fstab entirely.
I will try to remove /etc/fstab to see if this makes it shutup.
> I don't see why one would want to mask systemd-logind.service. If you
> permit logins and PAM at all, you really need that. 
If I wanted to add a login program I could enable/unmask these.
No one runs docker containers as login services, that would require getty. 
> And masking the getty stuff appears to be entirely unnecessary...
Again the goal is just to get rid of the getty failure message at bootup.
> Which leaves the /dev/hugepages and /sys/fs/fuse/connections
> mounts. Note sure about those. Are you running the container with
> CAP_SYS_ADMIN? If so, then there's no reason to mask those units. If
> not, then I figure we could add checks that these are conditioned out
> if CAP_SYS_ADMIN is missing.
No docker containers do not enable SYS_ADMIN or NET_ADMIN by default.
> On nspawn these two aren't seen since nspawn actually doesn't mount
> the real sysfs to /sys, but just a tmpfs with a select number of
> subdirectories from the real sysfs for security reasons. One of the
> subdirs that are suppressed is /sys/fs. Now,
> sys-fs-fuse-connections.mount is conditionalized on
> /sys/fs/fuse/connections existing, hence if it is not there, then it
> won't be mounted. And /dev/hugepages we simply allow to be mounted in
> the container.
Interesting idea.  Maybe we should just mount over /sys/fs also.

Do you just mount hugepages then during container setup?
> Lennart
>



More information about the systemd-devel mailing list