[systemd-devel] is the watchdog useful?

Mon Oct 21 21:32:08 UTC 2019

On Mon, Oct 21, 2019 at 05:50:44PM +0000, Zbigniew Jędrzejewski-Szmek wrote:
> In principle, the watchdog for services is nice. But in practice it seems
> be bring only grief. The Fedora bugtracker is full of automated reports of ABRTs,
> and of those that were fired by the watchdog, pretty much 100% are bogus, in 
> the sense that the machine was resource starved and the watchdog fired.
> 
> There a few downsides to the watchdog killing the service:
> 1. if it is something like logind, it is possible that it will cause user-visible
> failure of other services
> 2. restarting of the service causes additional load on the machine
> 3. coredump handling causes additional load on the machine, quite significant
> 4. those failures are reported in bugtrackers and waste everyone's time.
> 
> I had the following ideas:
> 1. disable coredumps for watchdog abrts: systemd could set some flag
> on the unit or otherwise notify systemd-coredump about this, and it could just
> log the occurence but not dump the core file.
> 2. generally disable watchdogs and make them opt in. We have 'systemd-analyze service-watchdogs',
> and we could make the default configurable to "yes|no".
> 
> What do you think?
> Zbyszek

I think the main issue is the watchdog timeout hasn't been tuned
appropriately for the environment it's being applied.

It's as if the timeouts are somewhere near the hard real-time
expectations end of the spectrum, while being applied to
non-deterministically delayed and scheduled normal priority userspace
processes.  It's a sort of impedance mismatch.

I /think/ the purpose of the watchdog is to detect when processes are
permanently wedged, capture their state for debugging, and forcefully
unwedge them.

That seems perfectly reasonable, but the timeout heuristic being used,
given our non-deterministic scheduling, should be incredibly long by
default.  It's not the kind of thing you want false positives on, folks
can always shrink the timeout if they find it's desirable.

Without having spent much time thinking about this, I'd lean towards
retaining the watchdogs but making their default timeouts so long a
program would have to be wedged for an hour+ before it triggered.

At least that way we preserve the passive information gathering of
serious bugs which might otherwise go unnoticed with background/idle
services, improving debugging substantially, but eliminate the problems
you describe resulting from false positives.

Regards,
Vito Caputo