[systemd-devel] Later activation of the HW watchdog

Tue Oct 24 15:10:39 UTC 2017

Hi,
is it possible to change systemd's global settings for RuntimeWatchdogSec 
at runtime? I would like to have the early boot "guarded" by the HW 
watchdog started by my platform code, and for systemd to take over only 
after a certain target has been reached. I was thinking about an extra unit 
which simply writes an appropriate config file, but the docs for `systemctl 
daemon-reload` or `daemon-reexec` do not talk about these top-level 
settins. How do I tell systemd to notice a new value?

Context: I'm using systemd on an embedded ARM box with reliable network 
connectivity. The system has two fully separate rootfs/kernel/devicetree 
instances, A and B. The bootloader starts a HW watchdog timer, and the 
bootloader keeps a counter tracking of how many times a particular A/B 
"boot slot" attempted to boot. The kernel ignores the watchdog, and once 
systemd gets launched and checks it system.conf file, it proceeds to 
re-start the WD timer periodically. Finally, a unit which is pulled in by 
my default target updates the bootloader's environment, resetting the boot 
counter.

My goal is to be able to boot a possibly broken image (but not a malicious 
one, of course) without fearing that it's going to lock me out of my 
device. If the new image "fails" for some reason, I epxect the HW watchdog 
to reset the system, the boot attempt counter to eventually reach zero, and 
the whole system to roll-back to the previous image, eventually. In my 
scneario, it's preferred to make the decision to reboot rather than waiting 
for human interaction for solving the actual problem. The once-failed slot 
can be re-flahed very cheapily, and an updated version can be re-tried 
during the next update attempt.

During my testing, I was able to unplug the system's SD card at a "wrong" 
moment which resulted in systemd trying to boot into emergency.target and 
ultimately failing due to a missing rootfs. I ended up with an unusable 
system which did not reboot automatically because systemd was periodically 
pinging the HW watchdog timer. [1]

I got a suggestion to adjust the important units so that they specify a 
FailureAction. I do not like that solution because it is additional work 
(identifying which units might fail, coming up with various possible 
failing scenarios, being hard to test and get "right" in face of systemd 
updates in future, etc). It also feels like I am attacking a wrong problem. 
I already *have* a watchdog which will shoot the system into the head if 
something wrong happens. Wouldn't it make more sense to rely on this piece 
of infrastructure and start telling the watchdog "hey, I'm OK" only after 
the system has fuly booted and my ultimate target has been *reached*?

SUggestions which offer additional possibilities are welcome. I like 
system'd feature set, and I won't pretend that I know all of them :).

With kind regards,
Jan

[1] https://github.com/systemd/systemd/issues/7063