[systemd-devel] Allow stop jobs to be killed during shutdown

Sun Jan 26 09:16:13 PST 2014

В Sun, 26 Jan 2014 17:23:54 +0100
Tom Gundersen <teg at jklm.no> пишет:

> 
> >> Unfortunately, setting KillMode=process is not allowed:
> >>
> >> Jan 26 17:12:30 linux-1a7f systemd[1]: user at 0.service has PAM enabled. Kill mode must be set to 'control-group'. Refusing.
> >>
> >> Probably user at .service should be exempt from this rule. It is supposed
> >> to handle all services started by it itself, it *is* service manager
> >> after all?
> 
> I don't think we want any processes to survive the exit of
> user at .service, so KillMode=process feels wrong. However, isn't the
> problem that we are going into the "kill control-group" mode too soon,
> before user at .serivce has had a chance of cleaning itself up
> gracefully?
> 

Yes.

> > I rebuilt systemd without this restriction, set KillMode=process for
> > user at .service and this fixed things here.
> >
> > So there are two problems associated with user instance.
> >
> > 1. Using KillMode=control-group is wrong. Each service managed by user
> > instance has own requirements how it is stopped. Just sending everything
> > SIGTERM without even trying service ExecStop first is obviously
> > incorrect.
> 
> I guess what we want is to first send SIGTERM only to the systemd
> --user process, and only after a timeout start sending SIGTERM to all
> the processes in the control group? I.e., wouldn't a ExecStop entry in
> user at .service give us the required timeout?
> 

Does not work. systemd sends SIGTERM as soon as ExecStop finished.

Jan 26 21:00:14 linux-1a7f systemd[1]: Stopping User Manager for 0...
Jan 26 21:00:14 linux-1a7f systemd[1]: About to execute: /usr/bin/kill -15 $MAINPID
Jan 26 21:00:14 linux-1a7f systemd[1]: Forked /usr/bin/kill as 1978
Jan 26 21:00:14 linux-1a7f systemd[1]: user at 0.service changed running -> stop
Jan 26 21:00:14 linux-1a7f systemd[1978]: Executing: /usr/bin/kill -15 1886
Jan 26 21:00:14 linux-1a7f systemd[1886]: Received SIGTERM from PID 1978 (kill).
Jan 26 21:00:14 linux-1a7f systemd[1886]: Activating special unit exit.target
Jan 26 21:00:14 linux-1a7f systemd[1886]: Trying to enqueue job exit.target/start/replace
Jan 26 21:00:14 linux-1a7f systemd[1886]: Installed new job exit.target/start as 9
Jan 26 21:00:14 linux-1a7f systemd[1886]: Installed new job systemd-exit.service/start as 10
Jan 26 21:00:14 linux-1a7f systemd[1886]: Installed new job shutdown.target/start as 11
Jan 26 21:00:14 linux-1a7f systemd[1886]: Installed new job -.slice/stop as 12
Jan 26 21:00:14 linux-1a7f systemd[1886]: Installed new job default.target/stop as 13
Jan 26 21:00:14 linux-1a7f systemd[1886]: Installed new job test.service/stop as 14
Jan 26 21:00:14 linux-1a7f systemd[1886]: Installed new job paths.target/stop as 15
Jan 26 21:00:14 linux-1a7f systemd[1886]: Installed new job timers.target/stop as 16
Jan 26 21:00:14 linux-1a7f systemd[1886]: Installed new job sockets.target/stop as 17
Jan 26 21:00:14 linux-1a7f systemd[1886]: Enqueued job exit.target/start as 9
Jan 26 21:00:14 linux-1a7f systemd[1886]: Stopping Test service with stop delay...
Jan 26 21:00:14 linux-1a7f systemd[1886]: About to execute: /bin/sleep 10
Jan 26 21:00:14 linux-1a7f systemd[1886]: Forked /bin/sleep as 2001
Jan 26 21:00:14 linux-1a7f systemd[1886]: test.service changed exited -> stop
Jan 26 21:00:14 linux-1a7f systemd[2001]: Executing: /bin/sleep 10
Jan 26 21:00:14 linux-1a7f systemd[1886]: Stopping Default.
...
Jan 26 21:00:14 linux-1a7f systemd[1]: Child 1978 died (code=exited, status=0/SUCCESS)
Jan 26 21:00:14 linux-1a7f systemd[1]: Child 1978 belongs to user at 0.service
Jan 26 21:00:14 linux-1a7f systemd[1]: user at 0.service: control process exited, code=exited status=0
Jan 26 21:00:14 linux-1a7f systemd[1]: user at 0.service got final SIGCHLD for state stop
Jan 26 21:00:14 linux-1a7f systemd[1]: user at 0.service changed stop -> stop-sigterm

I believe someone already mentioned this problem. In general, we cannot
assume that ExecStop is synchronous. It may just signal main process to
exit. systemd should wait until $MAINPID exits (or timeout) before
continuing further processing.

> > 2. user at .service has single timeout, but it manages unknown in advance
> > number of services each needing unknown timeout. While having some
> > capping to total timeout looks sensible, only user itself may estimate
> > the value. But service user at .system is system-level service which use
> > cannot configure ...
> 
> I think it really makes sense to have a system-wide timeout on these
> things (possibly a high one), we don't want the user to delay things
> without limit. The user already has the possibility of putting their
> own limits if they want to (but they must of course be shorter than
> the system-wide one).
> 

I mostly agree, except current 90 seconds look too small and this
definitely requires better communication to user (like auto exit from
quiet mode) so system does not appear to be hung.

There is also practical issue - we have two levels - PID 1 instance and
user instance (multiple users actually). Does it make sense to display
each individual user service as it shuts down? This would facilitate
troubleshooting. But then we have interleaved output from multiple
(independent) instances ...