[systemd-devel] is the watchdog useful?

Tue Oct 22 09:22:32 UTC 2019

On Mon, Oct 21, 2019 at 02:32:08PM -0700, Vito Caputo wrote:
> On Mon, Oct 21, 2019 at 05:50:44PM +0000, Zbigniew Jędrzejewski-Szmek wrote:
> > In principle, the watchdog for services is nice. But in practice it seems
> > be bring only grief. The Fedora bugtracker is full of automated reports of ABRTs,
> > and of those that were fired by the watchdog, pretty much 100% are bogus, in 
> > the sense that the machine was resource starved and the watchdog fired.
> > 
> > There a few downsides to the watchdog killing the service:
> > 1. if it is something like logind, it is possible that it will cause user-visible
> > failure of other services
> > 2. restarting of the service causes additional load on the machine
> > 3. coredump handling causes additional load on the machine, quite significant
> > 4. those failures are reported in bugtrackers and waste everyone's time.
> > 
> > I had the following ideas:
> > 1. disable coredumps for watchdog abrts: systemd could set some flag
> > on the unit or otherwise notify systemd-coredump about this, and it could just
> > log the occurence but not dump the core file.
> > 2. generally disable watchdogs and make them opt in. We have 'systemd-analyze service-watchdogs',
> > and we could make the default configurable to "yes|no".
> > 
> > What do you think?
> > Zbyszek
> 
> 
> I think the main issue is the watchdog timeout hasn't been tuned
> appropriately for the environment it's being applied.
> 
> It's as if the timeouts are somewhere near the hard real-time
> expectations end of the spectrum, while being applied to
> non-deterministically delayed and scheduled normal priority userspace
> processes.  It's a sort of impedance mismatch.
> 
> I /think/ the purpose of the watchdog is to detect when processes are
> permanently wedged, capture their state for debugging, and forcefully
> unwedge them.
> 
> That seems perfectly reasonable, but the timeout heuristic being used,
> given our non-deterministic scheduling, should be incredibly long by
> default.  It's not the kind of thing you want false positives on, folks
> can always shrink the timeout if they find it's desirable.

It is now 3 minutes in all systemd units. Dunno, maybe we should make
that 30 minutes.

Zbyszek

> Without having spent much time thinking about this, I'd lean towards
> retaining the watchdogs but making their default timeouts so long a
> program would have to be wedged for an hour+ before it triggered.
> 
> At least that way we preserve the passive information gathering of
> serious bugs which might otherwise go unnoticed with background/idle
> services, improving debugging substantially, but eliminate the problems
> you describe resulting from false positives.