[systemd-devel] [PATCH v3 2/4] service: add watchdog restart/reboot timeouts

Thu Feb 2 04:07:23 PST 2012

On Wed, Feb 01, 2012 at 11:26:09PM +0100, Lennart Poettering wrote:
> On Wed, 01.02.12 20:11, Michael Olbrich (m.olbrich at pengutronix.de) wrote:
> 
> > 
> > On Wed, Feb 01, 2012 at 07:42:16PM +0100, Lennart Poettering wrote:
> > > On Wed, 01.02.12 17:17, Michael Olbrich (m.olbrich at pengutronix.de) wrote:
> > > 
> > > > This patch adds the WatchdogRestartSec and WatchdogRebootSec
> > > > properties to services. Systemd will restart the service / reboot the
> > > > system if the watchdog timeout has not been updated for the configured
> > > > amount of time.
> > > > This functionality is only enabled if the watchdog timeout is set at
> > > > least once.
> > > 
> > > Do we really need two timers for this? To me it appears more natural to
> > > introduce one timer, and one option to configure what should happen if
> > > the timeout is reached, simply because we might end up with more than
> > > just two options here (in fact, already I figure we need four...):
> > > 
> > > WatchdogSec=...
> > > WatchdogAction=restart|reboot|reboot-force|reboot-immediate
> > > 
> > > Where:
> > > 
> > > restart = restart the service
> > > reboot = reboot the the machine cleanly, i.e. with shutting down all
> > >          services and unmounting/syncing all disks, via initrd if used
> > > reboot-force = reboot the machine semi-cleanly, i.e. don't shut down
> > >                anything, but still unmount/sync disks and initrd
> > > reboot-immediate = reboot the machine immediately, i.e. don't do
> > >                    anything, but calling the reboot() system call right-away.
> > > 
> > > Does that make sense? Or can you make a case for having individual
> > > timeouts for these?
> > 
> > The problem is, that restart and reboot are used to recover from two
> > different error sources. Restart is of errors in the service itself. Reboot
> > is to recover from outside problems.
> > The idea is to have multiple escalation layers:
> > Lets say the basic error is, that reading from a block device will block
> > forever (hardware or driver issue).
> > - First we try to restart the application. This will fail again.
> > - Next we try to reboot. This might work, if the block device is not needed
> >   to shutdown the system
> > - If systemd is affected as well, then at some point the hardware watchdog
> >   triggers. Hopefully after rebooting the block device will work again.
> > 
> > If just one action is required, that can be achieved with the appropriate
> > timeouts.
> 
> Hmm, so I think this should work differently. We already have a failure
> logic for services, and if a service fails to send us WATCHDOG=1 in time
> we should make use of that. That way, if service start up times out, if
> a service crashes or exits with a return value != 0 or if it doesn't
> send watchdog requests will be handled the same and would take advantage
> of the normal, already existing restart logic, that has the holdoff time
> implemented and everything.

So instead of just restarting the service, we set the state to "failed".
Then the existing restart-on-failure can kick in. Did I understand that
correctly?
Still, we need to configure what "a service fails to send us WATCHDOG=1 in
time" means. A timeout needs to be defined somewhere.
I see this in the service file, or maybe send "WATCHDOG=10s" and "in time"
means in the next 10 seconds, or something like that?

> And then, on top of that we should have a configurable restart
> ratelimiter, that can be configured to disable restarting if triggered,
> or trigger a reboot. Putting all that together we should provide more or
> less what you are asking for but also have the benefit that the hold off
> time applies and that we can benefit from the reboot logic also for
> services that fail due to some other reason.
> 
> Does that make sense?

So we try to restart the failed service. If we run into the startup timeout
we reboot. This should work correctly if the service sends the first
WATCHDOG=1 with the READY=1.

Still, this depends on a correctly working systemd. What we need is an
integrity check that is executed before writing the keep-alive to the
hardware watchdog.

Michael

-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |