[systemd-devel] "Inactive/dead" services that are enabled are indistinguishable from unused or oneshot services

Thu Mar 17 23:08:01 PDT 2011

On Thu, 17 Mar 2011 22:48:36 +0100
Lennart Poettering <lennart at poettering.net> wrote:

> On Thu, 17.03.11 10:20, Mike Kazantsev (mk.fraggod at gmail.com) wrote:
> 
> > On Thu, 17 Mar 2011 01:39:19 +0100
> > Lennart Poettering <lennart at poettering.net> wrote:
> > 
> > Experiencing several reboots on a machine with 50+ enabled daemons
> > I've noticed that some of them (mostly the ones, started via some
> > "laucher script" like apachectl, pg_ctl, ejabberdctl, etc) tend to
> > "cleanly" fail randomly on start just because GuessMainPID= mechanism
> > fails and systemd actually kills the service.
> 
> Hmm, GuessPID= fails? Do you know why exactly? Ideas for improvements?
> 
> The current logic is pretty simply: we look for all processes in the
> service cgroup which have PPID == 1. If there is only one of these, we
> assume it is the main process. In your case there hence must be more
> than once where this condition applies? Any recommendation would else we
> could check?
> 

For some services I've observed following behavior:
 * logs state that service received sigterm and is shutting down.
 * systemd status shows "Main PID" that differs from the one in the
   logs and/or pidfile.

Thus I assume that in all these cases launcher forks more than one
process and when the first forked one (which gets marked as "main")
dies, systemd pulls the plug and just kills the rest of them.

Problem seem to be related to timing and maybe some switch like
GuessMainPIDAfterRunningForSec= would help, but it'd still be racy, so
disabling pid guessing and using PIDFile= seem to be a better way to do
it with app's cooperation, and all the apps with such complicated start
seem to support pidfiles, so I don't think anything else is necessary
there, unless pidfile-eradication becomes some kind of crusade, but
then all such "launchers" should probably just go away as well.

> > I understand that there's a limited number of reasons for such "clean
> > stop" (manual interaction, units like rsyslog.service, Conflicts=,
> > isolate, etc), but still it's a wrong way to approach the particular
> > problem.
> > 
> > I've solved the problem for myself by writing a simple dbus-python
> > script (http://goo.gl/V6e7V). It shows exactly everything that's
> > enabled and not active (with "oneshot" exception), not some random
> > subset of this.
> 
> Hmm, jupp. I agree, this is very useful. I added this to the todo list now.
> 

Thanks!

> > Unfortunately, new rsyslog.service (and services using "systemctl stop"
> > directly) can affect such display, which I think shows the flawed
> > assumption that "enabled" in systemd means "should be active,
> > period" (with the exception of "oneshot" units) on my part, and I don't
> > know easy solution to this, short of adding another enabled-like state.
> 
> Hmm, yeah. This problem is hard. But I think simply showing "enabled but
> not running" is already quite useful, even if a service on that list is
> not necessarily buggy, but just not hooked in by anything.
> 

I think Andrey's "systemctl --query" suggestion in this thread or
special "systemd-query" tool should already be able to provide such
functionality (and more), so it should be a good enough solution.

Combined with "failed" state for dead-services-that-shouldn't-be it
should be even better - services stopped via systemctl like that won't
have "failed" state, so they can be easily filtered out by the same
query tool or grep/awk.

-- 
Mike Kazantsev // fraggod.net
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/systemd-devel/attachments/20110318/ae9018a2/attachment.pgp>