[systemd-devel] [PATCH v3 2/4] service: add watchdog restart/reboot timeouts

Thu Feb 2 10:42:46 PST 2012

On Thu, 02.02.12 13:07, Michael Olbrich (m.olbrich at pengutronix.de) wrote:

> > Hmm, so I think this should work differently. We already have a failure
> > logic for services, and if a service fails to send us WATCHDOG=1 in time
> > we should make use of that. That way, if service start up times out, if
> > a service crashes or exits with a return value != 0 or if it doesn't
> > send watchdog requests will be handled the same and would take advantage
> > of the normal, already existing restart logic, that has the holdoff time
> > implemented and everything.
> 
> So instead of just restarting the service, we set the state to
> "failed".

Yes, pretty much that. Currently failed is a boolean. We probably should
make that an enum though, so that we can distuingish the failure
reasons.

> Then the existing restart-on-failure can kick in. Did I understand that
> correctly?

Yes.

> Still, we need to configure what "a service fails to send us WATCHDOG=1 in
> time" means. A timeout needs to be defined somewhere.
> I see this in the service file, or maybe send "WATCHDOG=10s" and "in time"
> means in the next 10 seconds, or something like that?

I think a WatchdogSec= would make a ton of sense for this. If set to 0
(the default) we'd not have any watchdog logic, but if it is set for a
service it needs to send a message at least this often.

Hmm, it might make sense to pass the information that we expect a
watchdog msg in a certain interval also to the executed processes via an
env var.

Maybe we should just set WATCHDOG_USEC=4711000000 when spawning a process
with WatchdogSec=4711 set? That way, a the usual event loops of the
various services might just parse this env var and configure its wakeups
for it. 

> > And then, on top of that we should have a configurable restart
> > ratelimiter, that can be configured to disable restarting if triggered,
> > or trigger a reboot. Putting all that together we should provide more or
> > less what you are asking for but also have the benefit that the hold off
> > time applies and that we can benefit from the reboot logic also for
> > services that fail due to some other reason.
> > 
> > Does that make sense?
> 
> So we try to restart the failed service. If we run into the startup timeout
> we reboot. This should work correctly if the service sends the first
> WATCHDOG=1 with the READY=1.

Well, we'd try to restart a failed service, and if it fails too often in
a certain time frame we'd either just stop restarting it or reboot with
any of the three actions.

> Still, this depends on a correctly working systemd. What we need is an
> integrity check that is executed before writing the keep-alive to the
> hardware watchdog.

Yes, I think I understand your need. Just haven't wrapped my head around
how this should really work in the end. For now I'd like to see this as
different steps:

a) track the watchdog alive messages (already merged)

b) hook up the watchdog with the existing failure logic and introduce
WatchdogSec for that.

c) pass watchdog frequency to executed processes via env var

d) replace failure boolean with an enum

e) extend start logic to do configurable rate limiting of starts,
plus optionally reboots when rate limiter is triggered. New options for
this: 

    StartLimitInterval=5min
    StartLimitBurst=5000
    StartLimitAction=none|reboot|reboot-force|reboot-immediate

f) hookup /dev/watchdog with all of this

Of these b) and c) should be fairly easy. I will implement d) in the
next half hour. e) is fairly easy since we already have ratelimit.c
which can be trivially reused i guess, and we actually already have a
static ratelimiter for this built-in, so this is mostly about adding
configurability and the reboot options to it. f) I am not entirely sure
how this should look in the end, need to think about this more.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.