[systemd-devel] [PATCH v3 2/4] service: add watchdog restart/reboot timeouts

Wed Feb 1 14:26:09 PST 2012

On Wed, 01.02.12 20:11, Michael Olbrich (m.olbrich at pengutronix.de) wrote:

> 
> On Wed, Feb 01, 2012 at 07:42:16PM +0100, Lennart Poettering wrote:
> > On Wed, 01.02.12 17:17, Michael Olbrich (m.olbrich at pengutronix.de) wrote:
> > 
> > > This patch adds the WatchdogRestartSec and WatchdogRebootSec
> > > properties to services. Systemd will restart the service / reboot the
> > > system if the watchdog timeout has not been updated for the configured
> > > amount of time.
> > > This functionality is only enabled if the watchdog timeout is set at
> > > least once.
> > 
> > Do we really need two timers for this? To me it appears more natural to
> > introduce one timer, and one option to configure what should happen if
> > the timeout is reached, simply because we might end up with more than
> > just two options here (in fact, already I figure we need four...):
> > 
> > WatchdogSec=...
> > WatchdogAction=restart|reboot|reboot-force|reboot-immediate
> > 
> > Where:
> > 
> > restart = restart the service
> > reboot = reboot the the machine cleanly, i.e. with shutting down all
> >          services and unmounting/syncing all disks, via initrd if used
> > reboot-force = reboot the machine semi-cleanly, i.e. don't shut down
> >                anything, but still unmount/sync disks and initrd
> > reboot-immediate = reboot the machine immediately, i.e. don't do
> >                    anything, but calling the reboot() system call right-away.
> > 
> > Does that make sense? Or can you make a case for having individual
> > timeouts for these?
> 
> The problem is, that restart and reboot are used to recover from two
> different error sources. Restart is of errors in the service itself. Reboot
> is to recover from outside problems.
> The idea is to have multiple escalation layers:
> Lets say the basic error is, that reading from a block device will block
> forever (hardware or driver issue).
> - First we try to restart the application. This will fail again.
> - Next we try to reboot. This might work, if the block device is not needed
>   to shutdown the system
> - If systemd is affected as well, then at some point the hardware watchdog
>   triggers. Hopefully after rebooting the block device will work again.
> 
> If just one action is required, that can be achieved with the appropriate
> timeouts.

Hmm, so I think this should work differently. We already have a failure
logic for services, and if a service fails to send us WATCHDOG=1 in time
we should make use of that. That way, if service start up times out, if
a service crashes or exits with a return value != 0 or if it doesn't
send watchdog requests will be handled the same and would take advantage
of the normal, already existing restart logic, that has the holdoff time
implemented and everything.

And then, on top of that we should have a configurable restart
ratelimiter, that can be configured to disable restarting if triggered,
or trigger a reboot. Putting all that together we should provide more or
less what you are asking for but also have the benefit that the hold off
time applies and that we can benefit from the reboot logic also for
services that fail due to some other reason.

Does that make sense?

Lennart

-- 
Lennart Poettering - Red Hat, Inc.