[systemd-devel] Later activation of the HW watchdog
Jan Kundrát
jan.kundrat at cesnet.cz
Tue Oct 24 15:10:39 UTC 2017
Hi,
is it possible to change systemd's global settings for RuntimeWatchdogSec
at runtime? I would like to have the early boot "guarded" by the HW
watchdog started by my platform code, and for systemd to take over only
after a certain target has been reached. I was thinking about an extra unit
which simply writes an appropriate config file, but the docs for `systemctl
daemon-reload` or `daemon-reexec` do not talk about these top-level
settins. How do I tell systemd to notice a new value?
Context: I'm using systemd on an embedded ARM box with reliable network
connectivity. The system has two fully separate rootfs/kernel/devicetree
instances, A and B. The bootloader starts a HW watchdog timer, and the
bootloader keeps a counter tracking of how many times a particular A/B
"boot slot" attempted to boot. The kernel ignores the watchdog, and once
systemd gets launched and checks it system.conf file, it proceeds to
re-start the WD timer periodically. Finally, a unit which is pulled in by
my default target updates the bootloader's environment, resetting the boot
counter.
My goal is to be able to boot a possibly broken image (but not a malicious
one, of course) without fearing that it's going to lock me out of my
device. If the new image "fails" for some reason, I epxect the HW watchdog
to reset the system, the boot attempt counter to eventually reach zero, and
the whole system to roll-back to the previous image, eventually. In my
scneario, it's preferred to make the decision to reboot rather than waiting
for human interaction for solving the actual problem. The once-failed slot
can be re-flahed very cheapily, and an updated version can be re-tried
during the next update attempt.
During my testing, I was able to unplug the system's SD card at a "wrong"
moment which resulted in systemd trying to boot into emergency.target and
ultimately failing due to a missing rootfs. I ended up with an unusable
system which did not reboot automatically because systemd was periodically
pinging the HW watchdog timer. [1]
I got a suggestion to adjust the important units so that they specify a
FailureAction. I do not like that solution because it is additional work
(identifying which units might fail, coming up with various possible
failing scenarios, being hard to test and get "right" in face of systemd
updates in future, etc). It also feels like I am attacking a wrong problem.
I already *have* a watchdog which will shoot the system into the head if
something wrong happens. Wouldn't it make more sense to rely on this piece
of infrastructure and start telling the watchdog "hey, I'm OK" only after
the system has fuly booted and my ultimate target has been *reached*?
SUggestions which offer additional possibilities are welcome. I like
system'd feature set, and I won't pretend that I know all of them :).
With kind regards,
Jan
[1] https://github.com/systemd/systemd/issues/7063
More information about the systemd-devel
mailing list