[systemd-devel] Making /run respect Container Memory Limits

Sat Sep 21 15:18:32 UTC 2024

I have a question regarding a bug and (primarily philosophical) approach to take to fix the bug.

Note, I'm not requesting the bug be fixed -- I'm happy to supply any patches, I am just keen to get some direction on the approach to take.

So, the bug appears due to very early mount decisions made at the very start of systemd's startup.

When it starts it makes its initial mount points, one of which is /run. This path is rigged to statically always be 20% space.

The problem here is that for big memory hosts (lets say for arguments sake 256 GiB) that hosts many very small containers with cgroup memory limits set (lets go with 2 GiB) the default configuration of systemd leads to an inevitable lockup. This feels like
a design oversight to me.

---

Whats basically happening is the following:
- we define a 2 GiB container and start the init process.
- 20% of host space is allocated to the container (52 GiB)
- systemd-journald starts and logs into `/run`. It itself is configured to only use 15% of the 52 GiB (7.8 GiB).
- time passes and logs accumulate in the containers /run until the /run path is close to or exceeding 2 GiB.
- container becomes constrained, OOM killer kills off processes.
- journal logs these oom kills too further using up space.
- this cycle continues until inevitably there is no longer any memory left, any valid processes to kill and the system becomes completely stalled and locked up.

This locked up behaviour is not very obvious for normal users to understand how a container gets completely locked up when it would appear to not be running anything.
Indeed, the ultimate fate of a container like this is an inevitable lockup since logging is going to log and consume tmpfs space beyond the constraints of the container.

---

After perusing around some of the code there seems to a fwe approaches to fixing this.

1. Add some code to `mount_setup` that goes back over `/run` and recalculates a value that respects the containers constraints in some sensible way (16MiB or 20% of the containers usage, whatever is higher) whenever a container type is detected. Then
issuing a remount.
2. Much like 1, but instead adding a function pointer field to `struct MointPoint` thats associated with /run to effectively do the same thing, but in a manner thats generalized enough it could be used for future stuff. (/dev might also benefit, but its
not as bad.)

However, the problem with 1 or 2 is that Ubuntu (probably other Debians) really dont like you remounting tmpfs filesystems in a user namespace later on down the line (in LXC, I'm aware of this). This is due to apparmor constraints.
To get around that problem one could actually write some code at `mount_setup` which unmounts and mounts `/run` again with correct sizes, avoiding triggering a security alert, providing nothing has been written to it (I haven't checked that bit). This also
feels pretty ugly.

So the philosophical aspect is, do you consider fixes where you anticipate the solution to be ineffective due to changes required by a distro provider? Or would be more suitable to fix the early boot process to not require distro providers to change their
software?

The trickier, 'proper' fix to all this is not to go back over run but to mount it with the correct size from the start. The problem here is that cgroupfs is mounted after (not during early boot) so its effectively impossible to calculate tmpfs sizes.
This would require refactoring early boot stuff to mount cgroupfs even earlier. That makes me pretty nervous as it feels as if some concious effort went into what gets mounted and when. I'm quite aware of cgruopv1 and cgroupv2 complexity to add to all this
also.

Finally another fix (and a fix we're using in production) is to simply remount with `/etc/fstab` the `/run` path with manually calculated sizes pertaining to the container. I'm assuming this could also be done with a generator unit of some kind too --
again -- this falls foul to distro level security issues of apparmor not allowing remounts. Our production solution is to disable the apparmor profile for LXC (sigh).

In any case, I'm really just looking for some direction before I consider some patches regarding this problem.

Whilst its workaround-able late boot, it seems to be a design oversight that its possible to produce a systemd system which is guaranteed to fail at some point in the future due to bad mount options, it feels to me a more sensible approach to mount options
should be done in code at early boot.