[systemd-devel] System stability when journald locks up

Lennart Poettering lennart at poettering.net
Wed May 30 12:59:25 PDT 2012


On Wed, 30.05.12 22:36, Marti Raudsepp (marti at juffo.org) wrote:

Heya,

> On Tue, May 29, 2012 at 11:29 PM, Lennart Poettering
> <lennart at poettering.net> wrote:
> > The journal is still very new. I think so far it is quite
> > stable, but there is definitely more work necessary to make it rock
> > solid in all corner cases.
> 
> Well, any concrete ideas? Locking the user out of his/her own system
> is the best way to become hated by sysadmins. I certainly don't want
> to see journald on my servers until that's addressed.
> 
> Maybe a client-side timeout in libsystemd-journal? While that could
> still effectively crash applications by slowing them down too much, at
> least it's possible to log in to inspect and fix the issue.

Well, this is hard to fix and inherent to the syslog client side,
i.e. glibc's syslog() call. It's synchronous and hence SIGSTOPing your
syslog daemon of choice will eventually cause the whole system to
freeze.

Many people have configured their classic syslog daemon to output logs
on /dev/tty12. If you press C-s there (or accidentally hit Scroll Lock)
you end up freezing syslog too and thus freezing the entire machine
sooner or later. It's kind of a known problem.

Not sure what we could do about this. Sure, it is easy to fix the 
journal client libraries and have a timeout in there, but only the
smallest number of clients use the native libraries, most go via glibc's
syslog() call, and that is implemented synchronous. Fixing glibc would
definitely be a good idea though.

Adding the timeout change there (which would actually be dead-easy,
simply by using SO_SNDTIMEO) would not really fix the problem too well
though: given the amount of messages that are generated the system
might not be locked up entirely but still very slow.

I think avoid problems like this is mostly a problem of giving it a bit
of robustness love, and testing. Thankfully we now have Harald's test
system logic, and it is definitely my intention to get this extended so
that we can automatically test systemd in all kinds of memory/disk space
pressure problems.

In short: this is probably best fixed by rigorous, automated testing in
low-memory/low-disk space situations, not by adding timeouts which would
ameliorate the situation only minimally.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.


More information about the systemd-devel mailing list