[systemd-devel] systemd-journald may crash during memory pressure

Sat Feb 10 17:39:35 UTC 2018

On Sat, Feb 10, 2018 at 03:05:16PM +0100, Kai Krakow wrote:
> Am Sat, 10 Feb 2018 14:23:34 +0100 schrieb Kai Krakow:
> 
> > Am Sat, 10 Feb 2018 02:16:44 +0200 schrieb Uoti Urpala:
> > 
> >> On Fri, 2018-02-09 at 12:41 +0100, Lennart Poettering wrote:
> >>> This last log lines indicates journald wasn't scheduled for a long
> >>> time which caused the watchdog to hit and journald was aborted.
> >>> Consider increasing the watchdog timeout if your system is indeed that
> >>> loaded and that's is supposed to be an OK thing...
> >> 
> >> BTW I've seen the same behavior on a system with a single active
> >> process that uses enough memory to trigger significant swap use. I
> >> wonder if there has been a regression in the kernel causing misbehavior
> >> when swapping? The problems aren't specific to journald - desktop
> >> environment can totally freeze too etc.
> > 
> > This problem seems to be there since kernel 4.9 which was a real pita in
> > this regard. It's progressively becoming better since kernel 4.10. The
> > kernel seems trying to prevent swapping at any cost since then, at least
> > at the cost of much higher latency, and at the cost of pushing all cache
> > out of RAM.
> > 
> > The result is processes stuck for easily 30 seconds and more during
> > memory pressure. Sometimes I see the kernel loudly complaining in dmesg
> > about high wait times for allocating RAM, especially from the btrfs
> > module. Thus, the biggest problem may be that kernel threads itself get
> > stuck in memory allocations and are a victim of high latency.
> > 
> > Currently I'm running my user session in a slice with max 80% RAM which
> > seems to help. It helps not discarding all cache. I also put some
> > potentially high memory users (regarding cache and/or resident mem) into
> > slices with carefully selected memory limits (backup and maintenance
> > services). Slices limited in such a way will start swapping before cache
> > is discarded and everything works better again. Part of this problem may
> > be that I have one process running which mmaps and locks 1G of memory
> > (bees, a btrfs deduplicator).
> > 
> > This system has 16G of RAM which is usually plenty but I use tmpfs to
> > build packages in Gentoo, and while that worked wonderfully before 4.9,
> > I have to be really careful now. The kernel happily throws away cache
> > instead of swapping early. Setting vm.swappiness differently seems to
> > have no perceivable effect.
> > 
> > Software that uses mmap is the first latency victim of this new
> > behavior.
> > As such, also systemd-journald seems to be hit hard by this.
> > 
> > After the system recovered from high memory pressure (which can take
> > 10-15 minutes, resulting in a loadavg of 400+), it ends up with some
> > gigabytes of inactive memory in the swap which it will only swap back in
> > then during shutdown (which will also take some minutes then).
> > 
> > The problem since 4.9 seems to be that the kernel tends to do swap
> > storms instead of constantly swapping out memory at low rates during
> > usage. The swap storms totally thrash the system.
> > 
> > Before 4.9, the kernel had no such latency spikes under memory pressure.
> > Swap would usually grew slowly over time, and the system felt sluggish
> > one or another time but still usable wrt latency. I usually ended up
> > with 5-8G of swap usage, and that was no problem. Now, swap only
> > significantly grows during swap storms with an unusable system for many
> > minutes, with latencies of 10+ seconds around twice per minute.
> > 
> > I had no swap storm yet since the last boot, and swap usage is around
> > 16M now. Before kernel 4.9, this would be much higher already.
> 
> After some more research, I found that vm.watermark_scale_factor may be 
> the knob I am looking for. I'm going to watch behavior now with a higher 
> factor (default = 10, now 200).
> 

Have you reporteed this to the kernel maintainers?  LKML?

While this is interesting to read on systemd-devel, it's not right
venue.  What you describe sounds like a regression that probably should
be improved upon.

Also, out of curiosity, are you running dmcrypt in this scenario?  If
so, is swap on dmcrypt as well?

Regards,
Vito Caputo