[systemd-devel] [PATCH 0/3] make the service property StartLimitAction writeable

Tue Apr 17 02:16:06 PDT 2012

Hi,

On Tue, Apr 10, 2012 at 10:59:10PM +0200, Lennart Poettering wrote:
> On Fri, 06.04.12 21:37, Michael Olbrich (m.olbrich at pengutronix.de) wrote:
> Hmm, so, from your original watchdog work, what is still missing in git?
> You had some code that "multiplexed" the hw watchdog for individual
> services, but I couldn't wrap my head around it. Was there anything else
> left? (Or anything still on your wishlist?)
> 
> The multiplexing I am still not convinced off, btw. With all the code
> now in place we can soft-reboot the machine when a specific service
> doesn't react, and hard-reboot the machine when systemd doesn't
> react. With the multiplexing in place we'd simply forward the service
> watchdog events to the hw watchdog, but what precisely would we gain
> from that? I mean, we already can reboot the machine directly anyway if
> a service doesn't respond, why add the (potentially fragile) indirection
> to do the same via the hw watchdog timer? Or in other words, why would
> somebody choose to make use of this hw watchdog indirection rather than
> just tell systemd "StartLimitAction=reboot-immediate"? Can you explain?
> What am I missing?

Well all that code is from before we implemented StartLimitAction=, and I
think most use cases are covered. But unless I missed something start
limits cannot be used to reboot immediately without restarting at least
once. I've not found a good way to implement this both on configuration and
the implementations side.
Other than that, I think the service API has everything I need.

There is however another concern. While the start limit API is really nice
and powerful, it is also complex. And so is the rest of systemd. And as we
all know, every non-trivial piece of software has bugs. :-)
So, what if systemd gets it wrong? Maybe a bug in the logic. Or memory
corruption caused by systemd itself or even the kernel.
The problem with watchdog implementations is, that you always have to
consider the worst case.

So I've been thinking about doing some integrity checking before calling
watchdog_ping(). So, what can we check:
- Lists: self->next->prev == self is always a good test
- Range checks for enums
I'm sure there is more.

When I first experimented with this, by biggest concern was the service
watchdog, so I tried to double check it. Unfortunately I soon found out
that determining if StartLimitAction= should have been executed already is
non-trivial. This is on of the reasons why I'd like a way to trigger
StartLimitAction= immediately. In that case an expired watchdog
(+RestartSec) means we have a problem.

Would you accept patches for something like this?

> (as a side note: i submitted a little tool to util-linux which queries
> /dev/watchdog for its state and is useful to figure out what watchdog is
> available and what its capabilities are. I hope this is merged
> soon. This should be useful for everybody experimenting with hardware
> watchdogs...)

Sounds interesting. Do you have a link for this?

Regards,
Michael

-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |