[systemd-devel] [PATCH v3 2/4] service: add watchdog restart/reboot timeouts

Wed Feb 1 11:11:28 PST 2012

On Wed, Feb 01, 2012 at 07:42:16PM +0100, Lennart Poettering wrote:
> On Wed, 01.02.12 17:17, Michael Olbrich (m.olbrich at pengutronix.de) wrote:
> 
> > This patch adds the WatchdogRestartSec and WatchdogRebootSec
> > properties to services. Systemd will restart the service / reboot the
> > system if the watchdog timeout has not been updated for the configured
> > amount of time.
> > This functionality is only enabled if the watchdog timeout is set at
> > least once.
> 
> Do we really need two timers for this? To me it appears more natural to
> introduce one timer, and one option to configure what should happen if
> the timeout is reached, simply because we might end up with more than
> just two options here (in fact, already I figure we need four...):
> 
> WatchdogSec=...
> WatchdogAction=restart|reboot|reboot-force|reboot-immediate
> 
> Where:
> 
> restart = restart the service
> reboot = reboot the the machine cleanly, i.e. with shutting down all
>          services and unmounting/syncing all disks, via initrd if used
> reboot-force = reboot the machine semi-cleanly, i.e. don't shut down
>                anything, but still unmount/sync disks and initrd
> reboot-immediate = reboot the machine immediately, i.e. don't do
>                    anything, but calling the reboot() system call right-away.
> 
> Does that make sense? Or can you make a case for having individual
> timeouts for these?

The problem is, that restart and reboot are used to recover from two
different error sources. Restart is of errors in the service itself. Reboot
is to recover from outside problems.
The idea is to have multiple escalation layers:
Lets say the basic error is, that reading from a block device will block
forever (hardware or driver issue).
- First we try to restart the application. This will fail again.
- Next we try to reboot. This might work, if the block device is not needed
  to shutdown the system
- If systemd is affected as well, then at some point the hardware watchdog
  triggers. Hopefully after rebooting the block device will work again.

If just one action is required, that can be achieved with the appropriate
timeouts.

Michael

-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |