[systemd-devel] Making /run respect Container Memory Limits
Tomasz Pala
gotar at polanet.pl
Sat Sep 28 18:53:49 UTC 2024
On Mon, Sep 23, 2024 at 12:30:17 +0200, Lennart Poettering wrote:
> /run/ is only mounted by systemd if it is not pre-mounted already by
> the container manager. We generally assume the container manager does
> that (for example systemd-nspawn does it that way), already because
> /run/host/ is the mechanism to pass outside info/resources into the
> container in a systemd world, hence it really needs to be premounted.
Just for the record, as I've been investigating similar issue.
systemd-nspawn does premount several tmpfses, but exposes similar
behaviour to the OP-reported one. Accordingly to the values specified in
https://github.com/systemd/systemd/blob/main/src/basic/mountpoint-util.h
containers end up with:
/dev/shm /tmp and other
using NESTED_TMPFS_LIMITS: size=10% of the HOST RAM
/run using TMPFS_LIMITS_RUN having size=20%
As there's no user quota applied, and (at least for PrivateUsers=
containers) systemd-remount-fs cannot remount these mountpoints, all
such containers are vulnerable to unprivileged user DoS (OOM).
Only the /dev is protected against root mistakes (like cat /dev/zero > /dev/nul).
It would be nice to have these percent values being resolved against
container-restricted memory (like manually recalculating sizes using
MemoryMax= value), but as a band-aid solution I've came up with
following service template Wanted After nspawn:
[Unit]
Description=Remount sanely tmpfs fses inside systemd-nspawn@%i
After=systemd-nspawn@%i.service
[Service]
Type=oneshot
ExecStart=-:/bin/sh -c 'nsenter -t $( machinectl show %i -p Leader --value ) -m mount -o remount,size=1G,noexec /dev/shm 2>/dev/null'
ExecStart=-:/bin/sh -c 'nsenter -t $( machinectl show %i -p Leader --value ) -m mount -o remount,size=2G /tmp 2>/dev/null'
ExecStart=-:/bin/sh -c 'nsenter -t $( machinectl show %i -p Leader --value ) -m mount -o remount,size=2G /run 2>/dev/null'
SyslogIdentifier=nspawn-remount-tmpfses@%i
Above commands return EPERM from mount_setattr(), fortunately
fsconfig(4, FSCONFIG_SET_STRING, "size", "1G", 0)
is called before that and apparently works.
I use this method (nsenter) to alter nspawn configuration, that has no
appropriate options in nspawn itself and is forbidden inside container
(when unprivileged, despite namespaced), e.g.:
nsenter -t [...] -U -F sysctl -w user.max_user_namespaces=0
to reduce kernel attack surface from within not-so-trusted containers.
--
Tomasz Pala <gotar at pld-linux.org>
More information about the systemd-devel
mailing list