[systemd-devel] how to let systemd hibernate start/stop the swap area?

Fri Mar 31 19:16:43 UTC 2023

On Fri, 31 Mar 2023, Lennart Poettering wrote:
[...]
> Presumably your system mmaps ELF binaries, VM images, and similar
> stuff into memory. if you don't allow anonymous memory to backed out
> onto swap, then you basically telling the kernel "please page out
> my program code out instead". Which is typically a lot worse.

Yes, but my point is that it _doesn't matter_ if SSH or journald or 
whatever is in memory or needs to be paged back in again. It's such a tiny 
fraction of the system's overall workload.

This is why Luca's suggestion of using memory.swap.max=0 on all the QEMU 
processes isn't measurably better than just not using swap at all. Either 
99% of the system isn't using swap, or 100% of it isn't using swap.

> That's why I am saying that yeah, if you want zero IO then that's OK,
> but in that case you want *neither* anonymous memory being backed by
> disk swap *nor* file-backed memory backed by disk file systems. But
> you made the strange choice of saying "IO by file-backed memory is
> good", but "IO by anonymous memory" is bad, and then allow the former
> and forbid the latter.
> 
> hence my question: do you run your OS from an in-memory file system of
> some kind? because if not you just shift around what gets paged out,
> and because you make the pool of reclaimable memory smaller you
> increase IO.

In practice, everything that needed to run on the host was either already 
in memory or could be paged in quickly. Given the sum total of that was 
only a GB or so, that's not surprising.

[...]
> Well, in larger environments the goal is typically to saturate all
> hosts, but not overload them. i.e. maximizing your ROI. No need to
> fall from one extreme into the other. Today's Linux can actually
> achieve something like this, if you use it properly. Swap is part of
> using it "properly".
> 
> Oversized hw is typically a bad investment. In particular in today's
> cloud world where costs multiply with every node you have.

If customers have paid for RAM, you don't turn around and given them swap 
instead. That's just plain dishonest.

So yes, the system _does_ need to have more physical memory than the sum 
of the guests' virtual memory. Then you add on a bit more so you've got 
some room for buffers and page cache, since (at least in my case) IO was 
local. That's the size of the server you need.