[systemd-devel] Revisiting the "ExecRestart" issue

Fri Mar 28 10:12:27 PDT 2014

Hi all,
   I've brought this up before, but I became busy/discouraged and dropped
the ball.  As systemd becomes increasingly widely deployed, I can no longer
afford to do so, so I'd like to explore this area a bit further on the list
again and see if we can't come up with a workable solution, or if perhaps
I've missed some systemd/cgroups change in the past year or so that already
allows a workaround.

  To recap the previous discussion, see the threads at these links (same
thread, two different months in the thread-list):
http://lists.freedesktop.org/archives/systemd-devel/2012-November/007595.html
http://lists.freedesktop.org/archives/systemd-devel/2012-December/007804.html
  As well as this referenced/related thread from even earlier (different
author, but I suspect his issues are similar at the core of things):
http://lists.freedesktop.org/archives/systemd-devel/2012-June/005400.html

  The daemon I'm working on is the DNS server gdnsd (
https://github.com/~blblack/gdnsd ).  While trying to keep this short (fat
chance!), these are the core unique things that matter about it from a
systemd perspective, and how they seem to paint me into a corner:

0) It's meant to be somewhat portable outside of systemd and Linux, at
least to the *BSDs.  While I'm completely open to doing some small
(runtime-|autoconf-)conditional blocks of systemd-specific code in place of
traditional daemon code where it makes sense, I can't go and rewrite
everything in a new structure that only makes sense under systemd.

1) The daemon is designed to work as its own initscript.  Not unique, but
certainly less-common.  It ships a daemon binary which accepts
initscript-actions on the commandline.  So, "/usr/sbin/gdnsd start" forks
off a daemon, "/usr/sbin/gdnsd stop" kills the existing daemon, ditto for
"/usr/sbin/gdnsd status", and all the other common initscript verbs.  The
internal code is already handling unracy stops and starts, pidfile locking,
reliable "status", proper daemonization, privilege drop, etc through all of
this.  Most traditional sysvinit-like systems of course will use a real
shell initscript at runtime, and the real initscript can just invoke these
verbs, perhaps redirecting their verbose output to /dev/null (and know that
pidfiles and processes and whatnot are already well-managed and not need to
write clunky/racy shell code to try to solve those problems).

2) During startup of a fresh daemon, a number of operations have to happen
in a serial fashion due to hard dependency constraints, and for some users
these startup operations can take significant wallclock time relative to
desired service availability.  These events including things like loading
zonefiles (which can be expensive for large files or large counts of files,
which is a real world use-case today) and doing initial network-monitoring
polls of remote resources to set their initial state (which involve
timeouts for network responses - these are done in parallel to the degree
possible, but this can still add several seconds for reasonable
monitor-counts with reasonable timeouts).  All of these things must
complete before the new daemon can begin answering requests legitimately on
its listening sockets.

3) As you can imagine, this creates a problem for the traditional "restart"
verb: If one stops and then starts, there can be a long gap of service
unavailability.  To remedy this, I moved in the direction of having the
internal "restart" verb work in an overlapped fashion.  The way "restart"
is implemented basically follows this logic:
   a) restart is just a special case of "start"
   b) it parses configuration and does all the potentially-long operations
of a normal start first
   c) if anything fails (due to a new configuration error, etc), it dies
and leaves the old daemon instance alone.
   d) when it successfully reaches the point where it and the existing
daemon can no longer co-exist (because it needs to steal the bound
sockets), it *then* kills the old daemon using the "stop" logic, locks the
pidfile for itself, binds the sockets, and continues on as the new daemon.
   e) (and actually, in the upcoming next branch, SO_REUSEPORT will be used
to overlap the sockets as well, allowing for truly zero-packets-lost during
these restart operations).

4) Socket Activation! I know this is what some will scream when they skim
the above, but it's not a realistic solution in this case for a few reasons:
    a) The startup delay, in some cases, can be many whole wallclock
seconds.  This is necessary and acceptable in the general sense (this is
network service that people use with large server-side installations, not a
desktop thing).
    b) The primary socket traffic we care about is UDP, and further we
*really* care about request->response latency for this traffic.  Even if
you could set a large enough receive buffer to handle several seconds of
heavy UDP requests (and you can't, for at least some installations), the
multi-second-delay in the responses isn't reasonable.
    c) Another side-point that might be better addressed in another thread:
even if both of the above weren't true, this daemon uses several sockets
for multiple "roles" internally, some of which share all low-level details
(e.g. two distinct use-cases for multiple TCP sockets that serve different
high-level protocols, where the user might choose arbitrary ports for
both).  I'm not seeing any trivial way to distinguish these via socket
activation - perhaps some kind of socket "label" that could be accessed by
the daemon via sd_* APIs to distinguish would be useful here?

5) ExecReexec - this was one of Lennart's musings in the previous thread in
Dec2012.  However, this doesn't map well to gdnsd's model if implemented in
the "obvious" manner of having ExecRexec send a signal to the running
daemon to re-exec itself.  It would map well if gdnsd could respond to
SIGFOO via fork()->execve() on itself with the "restart" verb and let the
new instance replace itself when it's ready.  The problem is that the new
restarting copy needs elevated privileges to bind its sockets, which it
then loses permanently by the time it becomes a real daemon (and thus can't
provide to the newly execve'd copy).  In some cases we could pass on the
sockets on by clearing FD_CLOEXEC, but there's no guarantee as to what
socket bindings the new daemon will have: typically the same as before, but
perhaps the address or port number has changed in the config file for one
of five different sockets.
  To try to infer and diff the config/states of the old and new daemon
would be a complex mess.  What "gdnsd restart" wants to do is not a
"reload" or some halfway point between reload and restart, it's a full,
complete restart that re-evaluates everything freshly.  It just wants to
use overlapping in the time dimension to reduce the downtime of that event.
  (We do have a separate reload event for when just zonefiles have changed
but the rest of the configuration has not, and even support for monitoring
those at runtime without needing an event, but that's neither here nor
there and doesn't remove the need for an overlapped restart operation on
real config changes).

6) The TL;DR finale:

  What I'm really looking for here is a mechanism by which we can overlap
two daemon instances temporarily for a single service, with the new one
eventually replacing the old one.  The ideal would be that ExecRestart (or
whatever verb it ends up being) allows the possibility of the restart
command forking a new daemon becoming the main PID for the service after
killing off the existing one and taking over the pidfile.

  I've superficially looked around, and it's possible that I can do this
already (using ExecReload for the moment...) by essentially having the new
daemon read the cgroups of the old daemon and set them on itself manually
while it's still root, although I'm not sure what exactly would happen when
the primary PID changes out from under systemd (via the pidfile being
updated at "runtime" from systemd's perspective) and the old process dies.
 I have a bad feeling this would still lead to a SIGKILL of the new process
unless there were another mechanism to notify systemd of the changed PID,
but I haven't tested yet.  Even if such a hack works, I fear the basic
manual-cgroup-copying operation would be considered an unsupported
mechanism/interface and break in a future version.

  Given where things are at today, as best I can tell my best bet is to go
down that sort of road, though, and try to clone over the cgroups
memberships manually somehow during an ExecReload= command for this restart
(even though it really is a restart), and leaving true reloads (SIGHUP to a
running daemon) to be done from outside systemd.  And if that doesn't work,
well, I don't know what to do at this stage.  I understand the reluctance
to add these sorts of mechanisms in the general case because they're ripe
for mis-use by those porting hacky sysvinit scripts and whatnot.  Perhaps
rather than a new unit-file verb, a better way to allow this is through
re-purposing ExecReload for daemons like this, and having API calls (over
dbus? or a shlib call, either way) that the new daemon instance can invoke
that do the cgroup-copying and main-pid-switching?  I'd be happy to hack on
patches for some kind of solution myself, but I don't want to go off
hacking in a direction that will never get merged.

  Another option that crosses my mind is that perhaps there are existing
mechanisms (requiring some compile-time support in the code of gdnsd) for
it to become a manager of its own sub-scope of some kind where it's free to
handle these cases in the way that it wants to.  I really don't understand
how that works yet, but if there are reasonable paths forward in that
direction, I'd be willing to give that a shot as well.  I'm in the process
of updating/refactoring/improving the daemonization and restart code in
general for a new major release, so this is an ideal time to try to fix
systemd compatibility issues while I'm in there.

Thanks,
-- Brandon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/systemd-devel/attachments/20140328/cebc978c/attachment-0001.html>