[systemd-devel] logind vs CAP_SYS_ADMIN-lessness

Christian Seiler christian at iwakd.de
Fri Jan 23 10:35:08 PST 2015

Am 23.01.2015 um 18:57 schrieb Lennart Poettering:
>> Am 2015-01-23 08:29, schrieb Mantas Mikul─Śnas:
>>> IIRC, the reason for tmpfs on /run/user/* was lack of tmpfs quotas...
>>> if thats still a problem, maybe there could be one tmpfs at /run/user,
>>> still preventing users from touching root-only /run?
>> Yes, that's a good idea. Initially when posting this thread I thought
>> that there just had to be a trade-off between dropping CAP_SYS_ADMIN
>> (and making it more difficult to escape the container), and a user
>> inside the container DOSing the container by filling up /run.
>> But with your idea, I can at least separate /run/user from /run
>> itself 
> Hmm, which container manager are you using?

LXC 1.0.6 (which comes with Debian Jessie). I use the following
configuration for containers w/o CAP_SYS_ADMIN (note: I'm not claiming
this is secure (non-userns-containers may never be), and also this is
still work in progress and I'm only posting this as a proof of concept
and so that other people can reproduce it):


lxc.cgroup.use = @all


# This is still work in progress, I can probably get rid of some of
# those FSs, I'm not really comfortable with e.g. debugfs there.
# But if I remove them, I'll probably have to mask the units unless I
# want error messages on every container startup, and I would really
# like to keep the delta low... Still thinking about that.
lxc.mount.auto = proc sys cgroup:mixed
lxc.mount.entry = tmpfs dev/shm tmpfs rw,nosuid,nodev,create=dir 0 0
lxc.mount.entry = tmpfs run tmpfs rw,nosuid,nodev,mode=755,create=dir 0 0
lxc.mount.entry = tmpfs run/lock tmpfs
rw,nosuid,nodev,noexec,relatime,size=5120k,create=dir 0 0
lxc.mount.entry = debugfs sys/kernel/debug debugfs rw,relatime 0 0
lxc.mount.entry = mqueue dev/mqueue mqueue rw,relatime,create=dir 0 0
lxc.mount.entry = hugetlbfs dev/hugepages hugetlbfs
rw,relatime,create=dir 0 0
# here I'll probably add the /run/user entry
lxc.tty = 4
lxc.pts = 1024

lxc.cap.drop = sys_admin sys_module mac_admin mac_override net_admin
sys_time syslog

lxc.cgroup.devices.deny = a
lxc.cgroup.devices.allow = c *:* m
lxc.cgroup.devices.allow = b *:* m
lxc.cgroup.devices.allow = c 1:3 rwm   #/dev/null
lxc.cgroup.devices.allow = c 1:5 rwm   #/dev/zero
lxc.cgroup.devices.allow = c 1:7 rwm   #/dev/full
lxc.cgroup.devices.allow = c 5:0 rwm   #/dev/tty
lxc.cgroup.devices.allow = c 1:8 rwm   #/dev/random
lxc.cgroup.devices.allow = c 1:9 rwm   #/dev/urandom
lxc.cgroup.devices.allow = c 1:9 rwm   #/dev/urandom
lxc.cgroup.devices.allow = c 5:2 rwm   #/dev/pts/ptmx
lxc.cgroup.devices.allow = c 136:* rwm #/ev/pts/*
lxc.cgroup.devices.allow = c 254:0 rm  #/dev/rtc{,0}
lxc.cgroup.devices.allow = c 10:228 rm #/dev/hpet

# this is just the Debian default, I didn't change anything
# there (so far):
lxc.seccomp = /usr/share/lxc/config/common.seccomp

lxc.autodev = 1
lxc.kmsg = 0

lxc.haltsignal = SIGRTMIN+14


lxc.include = /etc/lxc/jessie-container.conf
lxc.utsname = something
lxc.rootfs  = /path/to/something
lxc.arch    = amd64

# network:
lxc.network.type = veth
# (and other directives that specify IP etc.)

Also inside the container the following changes w.r.t. vanilla Jessie:

 - explicitly enable getty at tty{1,2,3,4}.service
 - no ConditrionPathExists=/dev/tty0 for getty at .service
 - mask systemd-udevd.service (haven't tested if that's actually needed,
   the lxc-debian template also does this however)
 - touch /etc/fstab if you debootstrap it directly
 - I hope I didn't forget anything

Didn't try other Distros inside the containers yet (or LXC w/ systemd on
other distros for that matter).

Also, on the host, I DON'T have cgmanager or similar installed.

> I am tempted to just
> change nspawn to mount a private tmpfs into /run/user, too, as it
> already mounts /run anyway.

That would solve /run-quota issues for CAP_SYS_ADMIN-less containers,
but is unnecessary (although harmless) for those that do have it.

>> (the same way mode=1777 /run/lock is a separate tmpfs already)
>> by just a simple static mount entry for the container.
> Hmm, /run/lock is a sepatate tmpfs? /run/lock is a pretty useless,
> legacy thing. Which distro is this?

Debian Jessie. But a box with Fedora 19 here also has it (although not
mode=1777 but mode=0755) and in both Debian Jessie and Fedora 19 there
is some stuff in there. Although on Fedora it's not a separate tmpfs.

(Note that in Debian you can also configure it to be on the same tmpfs
as /run, but since on Debian it has mode 1777, there's a good reason NOT
to do that.)


More information about the systemd-devel mailing list