[systemd-devel] offline updates
Zbigniew Jędrzejewski-Szmek
zbyszek at in.waw.pl
Mon Jul 20 20:27:49 PDT 2015
[resending with the right systemd-devel address, sorry for that]
Here are some thoughts on offline updates resulting from testing
the new dnf fedup plugin developed by Will Woods
[https://github.com/wgwoods/dnf-plugin-fedup].
I ran an update using dnf fedup and it works (or would have worked, if
stuff didn't happen), which is already great for something so simple,
but it exposes some shortcomings in the Offline Update spec itself
[http://www.freedesktop.org/wiki/Software/systemd/SystemUpdates/].
The main issues are:
- what happens when multiple offline mechanisms are present
- how is failure handled
On my test system, I had packagekit-offline-update.service already
present when I installed the plugin and fedup-system-upgrade.service.
After running 'dnf fedup download ...' and 'dnf fedup reboot'
I saw something like this:
Jul 20 21:54:55 fedora22 systemd[1]: ConditionPathExists=/system-update/.fedup-system-upgrade succeeded for fedup-system-up
Jul 20 21:54:55 fedora22 systemd[1]: About to execute: /usr/bin/dnf --releasever=${RELEASEVER} fedup upgrade
Jul 20 21:54:55 fedora22 systemd[1]: Forked /usr/bin/dnf as 655
Jul 20 21:54:55 fedora22 systemd[1]: fedup-system-upgrade.service changed dead -> start
Jul 20 21:54:55 fedora22 systemd[1]: Starting System Upgrade...
Jul 20 21:54:55 fedora22 systemd[655]: Executing: /usr/bin/dnf --releasever=rawhide fedup upgrade
Jul 20 21:54:55 fedora22 systemd[1]: About to execute: /usr/libexec/pk-offline-update
Jul 20 21:54:55 fedora22 systemd[1]: Forked /usr/libexec/pk-offline-update as 657
Jul 20 21:54:55 fedora22 systemd[1]: packagekit-offline-update.service changed dead -> running
Jul 20 21:54:55 fedora22 systemd[1]: Job packagekit-offline-update.service/start finished, result=done
Jul 20 21:54:55 fedora22 systemd[657]: Executing: /usr/libexec/pk-offline-update
Jul 20 21:54:55 fedora22 systemd[1]: Started Updates the operating system whilst offline.
Jul 20 21:54:55 fedora22 systemd[1]: Starting Updates the operating system whilst offline...
fedup-system-upgrade.service uses an additional flag file which is
checked with ConditionPathExists so it will not run if 'dnf fedup reboot'
did not create the flag, even if we go into system-upgrade.target.
packagekit-offline-update.service does not have anything like this, and
is always run in system-upgrade.target.
Running two upgrade mechanisms in parallel does not seem like a good
idea. (Even if they use a lock file to prevent concurrent access to
the rpm database, they are bound to interfere with one another: the
first finishes and decides to reboot, or the first updates some
packages and messes up the state for the second one...) It seems that
*some* mechanism to run only one upgrade mechanism is wanted. The approach
that dnf-plugin-fedup uses seems reasonable: it creates the file only when
a reboot with 'dnf fedup reboot' is requested.
As an alternative we could allow only one upgrade mechanism to be enabled.
Dunno.
... continuing ...
Jul 20 21:55:00 fedora22 pk-offline-update[657]: percentage 14%
Jul 20 21:55:00 fedora22 pk-offline-update[657]: sent msg to plymouth 'Installing Updates - 14%'
Jul 20 21:55:00 fedora22 dnf[655]: babl x86_64 0.1.12-3.fc23 @commandline 235 k
Jul 20 21:55:00 fedora22 dnf[655]: baekmuk-bdf-fonts noarch 2.2-17.fc23 @commandline 6.9 M
Jul 20 21:55:00 fedora22 dnf[655]: baekmuk-ttf-batang-fonts noarch 2.2-39.fc23 @commandline 3.6 M
...
Jul 20 21:55:00 fedora22 pk-offline-update[657]: status download
Jul 20 21:55:00 fedora22 pk-offline-update[657]: package downloading gstreamer1-1.4.5-1.fc22.x86_64 (fedora)
Jul 20 21:55:00 fedora22 pk-offline-update[657]: status finished
Jul 20 21:55:00 fedora22 pk-offline-update[657]: writing failed results
Jul 20 21:55:00 fedora22 pk-offline-update[657]: failed to update system: cannot download Packages/g/gstreamer1-1.4.5-1.fc2
...
Jul 20 21:55:16 fedora22 systemd[1]: Trying to enqueue job reboot.target/start/replace
Jul 20 21:55:16 fedora22 systemd[1]: Job system-update.target/start finished, result=canceled
Jul 20 21:55:16 fedora22 systemd[1]: Installed new job system-update.target/stop as 762
...
Jul 20 21:55:16 fedora22 systemd[1]: Spawning new thread for sync
Jul 20 21:55:16 fedora22 systemd[1]: Installed new job time-sync.target/stop as 736
Jul 20 21:55:16 fedora22 systemd[1]: Installed new job lvm2-lvmetad.service/stop as 753
Jul 20 21:55:16 fedora22 systemd[1]: Job fedup-system-upgrade.service/start finished, result=canceled
Jul 20 21:55:16 fedora22 systemd[1]: Installed new job fedup-system-upgrade.service/stop as 769
Jul 20 21:55:16 fedora22 systemd[1]: Enqueued job reboot.target/start as 658
Jul 20 21:55:16 fedora22 systemd[1]: packagekit-offline-update.service failed.
...
Jul 20 21:55:11 fedora22 systemd[1]: packagekit-offline-update.service: main process exited, code=exited, status=1/FAILURE
Jul 20 21:55:11 fedora22 systemd[1]: packagekit-offline-update.service changed running -> failed
Jul 20 21:55:11 fedora22 systemd[1]: Unit packagekit-offline-update.service entered failed state.
Jul 20 21:55:11 fedora22 systemd[1]: Triggering OnFailure= dependencies of packagekit-offline-update.service.
Jul 20 21:55:16 fedora22 systemd[1]: Job system-update.target/stop finished, result=done
Jul 20 21:55:16 fedora22 systemd[1]: fedup-system-upgrade.service changed start -> stop-sigterm
...
Jul 20 21:55:29 fedora22 systemd-journal[514]: Suppressed 978 messages from /system.slice/fedup-system-upgrade.service
Jul 20 21:55:41 fedora22 dnf[655]: Upgrading : glibc-common-2.21.90-18.fc24.x86_64 29/3693
Jul 20 21:55:41 fedora22 systemd[1]: Serializing state to /run/systemd
Jul 20 21:55:41 fedora22 systemd[1]: Reexecuting.
...
now systemd reexecutes multiple time while dnf is updating packages
...
then things seems to go wrong
...
Jul 20 21:59:20 fedora22 systemd[1]: Looping too fast. Throttling execution a little.
Jul 20 21:59:22 fedora22 systemd[1]: Looping too fast. Throttling execution a little.
Jul 20 21:59:23 fedora22 systemd[1]: fedup-system-upgrade.service stop-sigterm timed out. Killing.
Jul 20 21:59:23 fedora22 systemd[1]: fedup-system-upgrade.service changed stop-sigterm -> stop-sigkill
Jul 20 21:59:23 fedora22 dnf[655]: Upgrading : pam-1.2.1-1.fc23.x86_64 263/3693
Jul 20 21:59:23 fedora22 systemd[1]: Child 655 (dnf) died (code=killed, status=9/KILL)
Jul 20 21:59:23 fedora22 systemd[1]: Child 655 belongs to fedup-system-upgrade.service
Jul 20 21:59:23 fedora22 systemd[1]: fedup-system-upgrade.service: main process exited, code=killed, status=9/KILL
Jul 20 21:59:23 fedora22 systemd[1]: fedup-system-upgrade.service changed stop-sigkill -> failed
Jul 20 21:59:23 fedora22 systemd[1]: Job fedup-system-upgrade.service/stop finished, result=done
Jul 20 21:59:23 fedora22 systemd[1]: Stopped System Upgrade.
Jul 20 21:59:23 fedora22 systemd[1]: Unit fedup-system-upgrade.service entered failed state.
Jul 20 21:59:23 fedora22 systemd[1]: fedup-system-upgrade.service failed.
Jul 20 21:59:23 fedora22 systemd[1]: Rebooting as result of failure.
...
reboot seems to proceed normally.
Based on the sequence of operations here, it seems that pk-offline-update
schedules a reboot on its own when it is unable to complete the
download, but it also has OnFailure=reboot.target so the reboot is
started a second time when pk-offline-update exits (I should check the
code, but I'm too lazy for that atm :)).
Also, which is a minor thing, but related: OnFailure=reboot.target
seems inferior to FailureAction=reboot. IIRC, the second one uses
irreversible transaction and should be more robust. It also is a
higher level setting in some sense. OnFailure=reboot.target is taken
directly from the spec, so should be changed there first.
Also, another related issue: packagekit-offline-update.service has
Type=simple. (In the log above it is "started" almost immediately, so
system-update.target could be reached while it is still running.) This
should be Type=oneshot.
It seems that failure handling is already shaky, but I think there more
failure modes. Let's say that 'dnf fedup upgrade' didn't work for some
reason (missing ConditionPathExists file, dnf installation problem, whatever).
Then nothing would remove the /system-update link, and we would reboot,
and run system-update.target again, and reboot, and run system-update.target.
In general, creating /system-update without a working update service
is enough to enter an infinite reboot loop.
The spec file says that system-update.target should be removed by the
service as early as possible, but it would be more robust to remove
it even earlier. ExecStartPre=/bin/rm /system-update would be one
option, but it is incompatible with Condition*s, because the service
should always run. It don't think it can be removed by the generator,
because the fs might still be ro when it runs (?). So maybe a tmpfiles
snippet should be used to remove the link. Such a change would mean
that the update services should not depend on the symlink being
present, and should instead look for their installation data in their
own state directory.
To summarize, following changes to the spec are proposed:
- use Condition* or similar to conditionalize whether a specific
upgrade mechanism should run
- use Action=reboot
- use Type=oneshot
- check that logind.Reboot() is not called on failure by the service
- services should not look for /systemd-update symlink,
and the symlink should be removed by tmpfiles before we even get to
the upgrade.
Zbyszek
More information about the systemd-devel
mailing list