[systemd-devel] is the watchdog useful?

Tue Oct 22 11:35:13 UTC 2019

On Tue, Oct 22, 2019 at 10:51:49AM +0000, Zbigniew Jędrzejewski-Szmek wrote:
> On Tue, Oct 22, 2019 at 12:34:45PM +0200, Umut Tezduyar Lindskog wrote:
> > I am curious Zbigniew of how you find out if the coredump was on a starved
> > process?
> 
> A very common case is systemd-journald which gets SIGABRT when in a
> read() or write() or similar syscall. Another case is when
> systemd-udevd workers get ABRT when doing open() on a device.
> 

In the case of journald, is it really in read()/write() syscalls you're
seeing the SIGABRTs?

When I worked on making journald offlining asynchronous, back then the
pimary source of problems was very slow fsync() calls.  But we've now
moved those into a thread, so that particular source of aborts should be
removed.

I'd be a little surprised to hear it's now aborting in read()/write()
calls, because journald's storage IO is all done through mmap() windows.

At the time I was working on it, something I latently wanted to explore
was converting journald to a more database-ish Direct-IO + AIO storage
engine.  Partially to eliminate the nondeterministic, potentially
lengthy stalls a thrashing or otherwise unresponsive backing store could
cause when journald faulted on an affected page in its mappings.

As long as journald is mmap()-based, it will always be suceptible to
long, uncontrolled delays blocking the event loop, as it uses a single
process for both its event loop and structure modify operations on those
mappings.

If it used AIO the IOs would all be submitted asynchronously and
completed as part of the same event loop responding to watchdog
timeouts.  So it'd be far less likely to starve servicing the event
sources, including the watchdog timer, since that'd be its principal
blocking point.

I realize these details are a bit orthogonal to the watchdog, but it
might help appreciate the extent of non-determinism causing such false
positives.

Programs like journald have not been structured carefully to minimize
event loop delays the watchdog perceives as hangs.  

Of course, if the journald process itself began faulting in and out
every time it was scheduled on a thrashing system, it could still become
delayed enough for an aggressive watchdog to fire.  It would be a lot
easier to pin its pages in memory and insulate it from those situations
if its working set were a relatively small, finite set used for
dispatching an upper-bound of concurrent async IO, just mlock most of
what it ever needs early on.  But we're nowhere near that architecture
today, and that's really the kind of approach needed to make things
watchdog-appropriate with a minimum of false positives.

Regards,
Vito Caputo