[systemd-devel] Making /run respect Container Memory Limits

Demi Marie Obenour demi at invisiblethingslab.com
Mon Sep 23 18:22:57 UTC 2024


On Mon, Sep 23, 2024 at 04:14:58PM +0200, Lennart Poettering wrote:
> On Mo, 23.09.24 11:58, Matthew Ife (matthew at ife.onl) wrote:
> 
> > > /run/ is only mounted by systemd if it is not pre-mounted already by
> > > the container manager. We generally assume the container manager does
> > > that (for example systemd-nspawn does it that way), already because
> > > /run/host/ is the mechanism to pass outside info/resources into the
> > > container in a systemd world, hence it really needs to be premounted.
> >
> > I think theabove is enough to know the right answer.
> > Fix the container manager to behave correctly. This feels like the most elegant approach.
> >
> > I didn't spot this when trying to understand the best approach to change things. Apologies.
> >
> > Note, you're right about how we do stupid things like disabling swap. Its not my call sadly!
> > Whilst I dont think the answer here is "adding swap will fix" there are a myriad other reasons to
> > have swap and it would at least elongate the cliff-edge we have with this problem otherwise.
> 
> Adding swap *will* fix the issue for you btw to a large degree.
> 
> By not having swap you make it impossible for tmpfs and anonymous
> memory to be paged out. You basically *create* an artificial OOM
> situation if any loads shows up, because you artifically minimize the
> amount of reclaimable pages: in most cases only mapped ELF binaries
> become reclaimable this way, so they will be constantly thrashed and
> everything goes to shit.
> 
> If you disable swap on a big server you are just misunderstanding how
> memory management works on Linux, and its pretty much your own
> fault. This might sound harsh, but it is how it is.

Does this mean that if something can't afford its working set to be
paged out for latency reasons, it _also_ can't afford its own code to be
paged out, and therefore should call mlockall() or otherwise explicitly
mlock() the code and data it is operating on, rather than expecting that
swap be disabled?

> Talk to whoever maintains these systems, and get them talk to some MM
> person and get educated about these things. There's a fundamental
> misunderstanding here how loaded systems need to be managed.
> 
> And if you then combine this with non-persistant journald, you are
> artificially amplifying the problem you artificially created for
> yourself, because you intentionally moved even more stuff that would
> normally be backed by disk into unreclaimable memory.
> 
> Lennart

Most (but not all) of the security concerns about swap can be mitigated
by using a dm-crypt volume with an ephemeral key.  Once the system
memory is wiped, the key is gone and with it any chance of accessing the
swapped-out data.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/systemd-devel/attachments/20240923/64639a7f/attachment.sig>


More information about the systemd-devel mailing list