[systemd-devel] Making /run respect Container Memory Limits

Tomasz Pala gotar at polanet.pl
Sat Sep 28 18:53:49 UTC 2024


On Mon, Sep 23, 2024 at 12:30:17 +0200, Lennart Poettering wrote:

> /run/ is only mounted by systemd if it is not pre-mounted already by
> the container manager. We generally assume the container manager does
> that (for example systemd-nspawn does it that way), already because
> /run/host/ is the mechanism to pass outside info/resources into the
> container in a systemd world, hence it really needs to be premounted.

Just for the record, as I've been investigating similar issue.

systemd-nspawn does premount several tmpfses, but exposes similar
behaviour to the OP-reported one. Accordingly to the values specified in
https://github.com/systemd/systemd/blob/main/src/basic/mountpoint-util.h
containers end up with:
/dev/shm /tmp and other
	using NESTED_TMPFS_LIMITS: size=10% of the HOST RAM
/run	using TMPFS_LIMITS_RUN having size=20%

As there's no user quota applied, and (at least for PrivateUsers=
containers) systemd-remount-fs cannot remount these mountpoints, all
such containers are vulnerable to unprivileged user DoS (OOM).

Only the /dev is protected against root mistakes (like cat /dev/zero > /dev/nul).

It would be nice to have these percent values being resolved against
container-restricted memory (like manually recalculating sizes using
MemoryMax= value), but as a band-aid solution I've came up with
following service template Wanted After nspawn:

[Unit]
Description=Remount sanely tmpfs fses inside systemd-nspawn@%i
After=systemd-nspawn@%i.service

[Service]
Type=oneshot

ExecStart=-:/bin/sh -c 'nsenter -t $( machinectl show %i -p Leader --value ) -m mount -o remount,size=1G,noexec /dev/shm 2>/dev/null'
ExecStart=-:/bin/sh -c 'nsenter -t $( machinectl show %i -p Leader --value ) -m mount -o remount,size=2G /tmp 2>/dev/null'
ExecStart=-:/bin/sh -c 'nsenter -t $( machinectl show %i -p Leader --value ) -m mount -o remount,size=2G /run 2>/dev/null'

SyslogIdentifier=nspawn-remount-tmpfses@%i


Above commands return EPERM from mount_setattr(), fortunately
	fsconfig(4, FSCONFIG_SET_STRING, "size", "1G", 0)
is called before that and apparently works.


I use this method (nsenter) to alter nspawn configuration, that has no
appropriate options in nspawn itself and is forbidden inside container
(when unprivileged, despite namespaced), e.g.:

nsenter -t [...] -U -F sysctl -w user.max_user_namespaces=0

to reduce kernel attack surface from within not-so-trusted containers.

-- 
Tomasz Pala <gotar at pld-linux.org>


More information about the systemd-devel mailing list