[systemd-devel] Making /run respect Container Memory Limits

Mon Sep 23 10:30:17 UTC 2024

On Sa, 21.09.24 16:18, Matthew Ife (matthew at ife.onl) wrote:

> I have a question regarding a bug and (primarily philosophical)
> approach to take to fix the bug.
>
> Note, I'm not requesting the bug be fixed -- I'm happy to supply any
> patches, I am just keen to get some direction on the approach to
> take.
>
> So, the bug appears due to very early mount decisions made at the
> very start of systemd's startup.
>
> When it starts it makes its initial mount points, one of which is
> /run. This path is rigged to statically always be 20% space.
>
> The problem here is that for big memory hosts (lets say for
> arguments sake 256 GiB) that hosts many very small containers with
> cgroup memory limits set (lets go with 2 GiB) the default
> configuration of systemd leads to an inevitable lockup. This feels
> like a design oversight to me.

Why an inevitable lockup? Not following? The limit is just a limit,
not more. i.e. only the space actually used matters, and
overcommitting is a thing, and the memory is fully pagable (unless you
turn of swap, but that's just dumb, and pretty much your own fault?).

> Whats basically happening is the following:
> - we define a 2 GiB container and start the init process.
> - 20% of host space is allocated to the container (52 GiB)

That's not correct. The *limit* is set to 20% of the system RAM. This
*allocates* no memory at all, that only happens when the space is
actually used.

> - systemd-journald starts and logs into `/run`. It itself is
>   configured to only use 15% of the 52 GiB (7.8 GiB).

systemd-journald logs to /run/ only for a short time during boot. In
containers that's a particularly short time, and then transitions to
/var/.

> - time passes and logs accumulate in the containers /run until the
>   /run path is close to or exceeding 2 GiB.

Hmm, it sounds as if you turned off persistent logging? i.e. you told
journald to fill /run/? You kinda are asking for this?

> - container becomes constrained, OOM killer kills off processes.
> - journal logs these oom kills too further using up space.
> - this cycle continues until inevitably there is no longer any memory left, any valid processes to kill and the system becomes completely stalled and locked up.
>
> This locked up behaviour is not very obvious for normal users to understand how a container gets completely locked up when it would appear to not be running anything.
> Indeed, the ultimate fate of a container like this is an inevitable
> lockup since logging is going to log and consume tmpfs space beyond
> the constraints of the container.

So, why do your containers use /run/ for logging? And did you disable
swap on your system, and thus amplify memory pressure?

> After perusing around some of the code there seems to a fwe approaches to fixing this.
>
> 1. Add some code to `mount_setup` that goes back over `/run` and
> recalculates a value that respects the containers constraints in
> some sensible way (16MiB or 20% of the containers usage, whatever is
> higher) whenever a container type is detected. Then issuing a
> remount.  2. Much like 1, but instead adding a function pointer
> field to `struct MointPoint` thats associated with /run to
> effectively do the same thing, but in a manner thats generalized
> enough it could be used for future stuff. (/dev might also benefit,
> but its not as bad.)

/run/ is only mounted by systemd if it is not pre-mounted already by
the container manager. We generally assume the container manager does
that (for example systemd-nspawn does it that way), already because
/run/host/ is the mechanism to pass outside info/resources into the
container in a systemd world, hence it really needs to be premounted.

> However, the problem with 1 or 2 is that Ubuntu (probably other
> Debians) really dont like you remounting tmpfs filesystems in a user
> namespace later on down the line (in LXC, I'm aware of this). This
> is due to apparmor constraints.  To get around that problem one

This really sounds like a bug in LXC/Apparmor policy. I mean, either
you should give the container control about /run/ or not (in which
case the container manager should set it up). But the combination of
"hey, paylad, you don't get to change it", and "no i won#t do anything
for my payload", is really broken.

Frankly: there seems to multiple msiconfigs in place:

1. apparmor policy seems bogus to allow mounting but not remounting of /run/
2. if you want to manage resources of your container, let it do the
   container manager, not the container, systemd will be happy.
3. you turned off swap?
4. you turned off persistent logging?

Lennart

--
Lennart Poettering, Berlin