[systemd-devel] sometimes systemd takes a very long time to restart service

Thu Jan 3 12:41:57 PST 2013

Zbigniew Jędrzejewski-Szmek wrote:
> On Wed, Jan 02, 2013 at 02:28:19AM -0700, JB wrote:
>   
>> Andrey Borzenkov wrote:
>>     
>>> В Tue, 01 Jan 2013 23:37:56 -0700
>>> JB <general at itpsg.com> пишет:
>>>       
>>>> Andrey Borzenkov wrote:
>>>>         
>>>>> В Tue, 01 Jan 2013 18:52:38 -0700
>>>>> JB <general at itpsg.com> пишет:
>>>>>           
>>>>>> Thanks!  I'll try and it may work in my case.  What's
>>>>>> interesting is that in your case it sounded like rsyslog was
>>>>>> hanging around while it was having problems dealing with the
>>>>>> condition of having the network unavailable.  In my case,
>>>>>> webrickd actually stops and shuts down almost immediately
>>>>>> but for some reason systemd doesn't or can't figure that
>>>>>> out.
>>>>>>             
>>>>> Showing "systemctl status webrickd.service" before restarting and
>>>>> during restart may give some hints.
>>>>>           
>>>> Good idea.
>>>>
>>>> It goes from active/running to this immediately after restart...
>>>>
>>>> webrickd.service - Configuration ruby webrick daemon
>>>>          Loaded: loaded (/etc/systemd/system/webrickd.service)
>>>>          Active: deactivating (final-sigterm) since Tue, 01 Jan
>>>> 2013 22:44:39 -0700; 1min 14s ago
>>>>         Process: 15771
>>>> ExecStart=/home/rtuser/app/bin/webrickd.rb -d -p
>>>> /home/rtuser/app/data/logs/webrickd.pid (code=exited,
>>>> status=0/SUCCESS)
>>>>
>>>> Then this, anywhere from 1 to 3 minutes later...
>>>>
>>>> webrickd.service - Configuration ruby webrick daemon
>>>>          Loaded: loaded (/etc/systemd/system/webrickd.service)
>>>>          Active: active (running) since Tue, 01 Jan 2013
>>>> 22:47:40 -0700; 15s ago
>>>>         Process: 15807
>>>> ExecStart=/home/rtuser/app/bin/webrickd.rb -d -p
>>>> /home/rtuser/app/data/logs/webrickd.pid (code=exited,
>>>> status=0/SUCCESS)
>>>>         
>>> And initial state (i.e. after it is started)?
>>>
>>>       
>>>> Any ideas why it sits in state "deactivating (final-sigterm)"
>>>> for 1 to 3 minutes
>>>>         
>> It gets better.  I took a very stripped simple daemon to try to
>> understand the interaction between systemd and the process.  Here it
>> is:
>>     
> Hi,
> your process is forking as it should (as least on my machine).
> Putting systemd in debug mode 
>     sudo kill -SIGRTMIN+22 1
> and running
>     sudo systemctl start webrickd.service
> gives me:
> Jan 02 08:55:27 spora sudo[20648]: zbyszek : TTY=pts/0 ; PWD=/home/zbyszek/src/systemd ; USER=root ; COMMAND=/bin/systemctl start webrickd.service
> Jan 02 08:55:27 spora systemd[1]: Accepted connection on private bus.
> Jan 02 08:55:27 spora systemd[1]: Running GC...
> Jan 02 08:55:27 spora systemd[1]: Got D-Bus request: org.freedesktop.systemd1.Manager.StartUnit() on /org/freedesktop/systemd1
> Jan 02 08:55:27 spora systemd[1]: Trying to enqueue job webrickd.service/start/replace
> Jan 02 08:55:27 spora systemd[1]: Installed new job webrickd.service/start as 2851
> Jan 02 08:55:27 spora systemd[1]: Enqueued job webrickd.service/start as 2851
> Jan 02 08:55:27 spora systemd[1]: About to execute: /usr/bin/ruby /home/zbyszek/src/systemd/test.rb
> Jan 02 08:55:27 spora systemd[1]: Forked /usr/bin/ruby as 20651
> Jan 02 08:55:27 spora systemd[1]: webrickd.service changed failed -> start
> Jan 02 08:55:27 spora systemd[1]: Got D-Bus request: org.freedesktop.systemd1.Manager.GetUnit() on /org/freedesktop/systemd1
> Jan 02 08:55:27 spora systemd[1]: Got D-Bus request: org.freedesktop.DBus.Properties.Get() on /org/freedesktop/systemd1/unit/webrickd_2eservice
> Jan 02 08:55:27 spora systemd[1]: Received SIGCHLD from PID 20651 (ruby).
> Jan 02 08:55:27 spora systemd[1]: Got SIGCHLD for process 20651 (ruby)
> Jan 02 08:55:27 spora systemd[1]: Child 20651 died (code=exited, status=0/SUCCESS)
> Jan 02 08:55:27 spora systemd[1]: Child 20651 belongs to webrickd.service
> Jan 02 08:55:27 spora systemd[1]: webrickd.service: control process exited, code=exited status=0
> Jan 02 08:55:27 spora systemd[1]: webrickd.service got final SIGCHLD for state start
> Jan 02 08:55:27 spora systemd[1]: Main PID guessed: 20653
> Jan 02 08:55:27 spora systemd[1]: webrickd.service changed start -> running
> Jan 02 08:55:27 spora systemd[1]: Job webrickd.service/start finished, result=done
> Jan 02 08:55:27 spora systemd[1]: Got D-Bus request: org.freedesktop.DBus.Local.Disconnected() on /org/freedesktop/DBus/Local
> Jan 02 08:55:27 spora systemd[1]: Received SIGCHLD from PID 20653 (ruby).
> Jan 02 08:55:27 spora systemd[1]: Got SIGCHLD for process 20653 (ruby)
> Jan 02 08:55:27 spora systemd[1]: Child 20653 died (code=exited, status=0/SUCCESS)
> Jan 02 08:55:27 spora systemd[1]: Child 20653 belongs to webrickd.service
> Jan 02 08:55:27 spora systemd[1]: webrickd.service: main process exited, code=exited, status=0
> Jan 02 08:55:27 spora systemd[1]: webrickd.service changed running -> stop-sigterm
> Jan 02 08:55:27 spora systemd[1]: Received SIGCHLD from PID 20656 (ruby).
> Jan 02 08:55:27 spora systemd[1]: Got SIGCHLD for process 20656 (ruby)
> Jan 02 08:55:27 spora systemd[1]: Child 20656 died (code=exited, status=1/FAILURE)
> Jan 02 08:55:27 spora systemd[1]: Accepted connection on private bus.
> Jan 02 08:55:27 spora systemd[1]: Got D-Bus request: org.freedesktop.systemd1.Agent.Released() on /org/freedesktop/systemd1/agent
> Jan 02 08:55:27 spora systemd[1]: webrickd.service: cgroup is empty
> Jan 02 08:55:27 spora systemd[1]: webrickd.service changed stop-sigterm -> dead
> Jan 02 08:55:27 spora systemd[1]: Got D-Bus request: org.freedesktop.DBus.Local.Disconnected() on /org/freedesktop/DBus/Local
> Jan 02 09:01:01 spora systemd[1]: Got D-Bus request: org.freedesktop.DBus.NameOwnerChanged() on /org/freedesktop/DBus
> Jan 02 09:01:01 spora systemd[1]: Running GC...
> Jan 02 09:01:01 spora systemd[1]: Collecting webrickd.service
>
> With normal UNIX double forking, the process should fork twice, and
> the first two should processes should exit. This is what happens,
> but in the wrong order. The first process is only supposed to exit
> after the main process is alive and ready to serve requests. As you
> can see from the log above, the first process exits while the second
> process is still running, and thus systemd thinks that this is the main
> process. Subsequenty it exits, and the service is considered dead.
>   
Ahh, that makes tremendous sense.  Thank you so much for taking a look 
at that.  I'm going to have to read up on how to put systemd in debug 
mode.  I've written daemon's before in C where I have complete control 
over this.  I wonder if the forking implementation Ruby's daemonize has 
this backwards.  The more I learn about Ruby the more frustrated I 
become with it.  I've worked around this problem by using the "Simple" 
type and not daemonizing the process.
> OTOH, your original problem (with restart taking >1 minute) is
> something different. What systemd version exactly are you using? Can
> you enable debugging and then post the output from the long 'systemctl
> restart webrick.service'?
>   
I'm using whatever comes with FC15.  According to the log it's version 
26 (RPM is 26-18)
I'm currently in crisis with this app and have already stripped the 
daemon code out and rewritten accordingly.  When I have this released, 
I'll go ahead and run the test you suggested but I think they might be 
related.