[systemd-devel] systemctl stop going through timeout even though all processes have exited

Mantas Mikulėnas grawity at gmail.com
Wed Oct 11 05:19:59 UTC 2023


Is this with cgroups v1 or v2? If cgroups v1 is involved (thanks Docker), I
recall it was a bit complex for systemd to get notified when the cgroup
actually empties – via /sys/fs/cgroup/systemd/release_agent that specifies
a helper executable that the kernel runs... I wonder if that mechanism is
broken on your system.

On Wed, Oct 11, 2023 at 7:38 AM Martin Schwenke <martin at meltin.net> wrote:

> I'm seeing "systemctl stop <service>" for several services taking a
> long time because it goes through the timeout process, even though all
> relevant processes have exited.
>
> I'll give 2 examples.  Both examples are running inside a privileged
> Rocky Linux 8.8 Docker container on a Rocky Linux 8.8 host.  The
> systemd version, reported by "systemctl --version" in the container
> is:
>
>   systemd 239 (239-74.el8_8.5)
>
> Here is ctdb.system:
>
>   [Unit]
>   Description=CTDB
>   Documentation=man:ctdbd(1) man:ctdb(7)
>   After=network-online.target time-sync.target
>   ConditionFileNotEmpty=/etc/ctdb/nodes
>
>   [Service]
>   Type=forking
>   LimitCORE=infinity
>   LimitNOFILE=1048576
>   TasksMax=4096
>   PIDFile=/var/run/ctdb/ctdbd.pid
>   ExecStart=/usr/sbin/ctdbd
>   ExecStop=/usr/bin/ctdb shutdown
>   KillMode=control-group
>   Restart=no
>
>   [Install]
>   WantedBy=multi-user.target
>
> "/usr/bin/ctdb shutdown" causes a controlled shutdown.  In many cases,
> starting and then stopping using systemctl works fine.  However, many
> times it takes >90s to stop, as per TimeoutStopSec.  If I reduce that
> value then the duration reduces accordingly.  I can confirm using both
> "ps auxfww" and "systemd-cgls" that within the container there are no
> relevant processes a moment after "systemctl stop ctdb" is run.  In
> particular, in systemd-cgls ctdb.service is gone but "systemctl stop
> ctdb" is still waiting.
>
> Before attempting to stop, the service is successfully started:
>
>   Oct 11 00:56:44 rocky1 systemd[710741]: ctdb.service: Executing:
> /usr/sbin/ctdbd
>   Oct 11 00:56:44 rocky1 ctdbd[710741]: CTDB logging to location
> file:/var/log/log.ctdb
>   Oct 11 00:56:44 rocky1 systemd[1]: Received SIGCHLD from PID 710741
> (ctdbd).
>   Oct 11 00:56:44 rocky1 systemd[1]: Child 710741 (ctdbd) died
> (code=exited, status=0/SUCCESS)
>   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Child 710741 belongs to
> ctdb.service.
>   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Control process exited,
> code=exited status=0
>   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Got final SIGCHLD for
> state start.
>   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: New main PID 710742
> belongs to service, we are happy.
>   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Main PID loaded: 710742
>   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Changed start -> running
>   Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Job ctdb.service/start
> finished, result=done
>   Oct 11 00:56:44 rocky1 systemd[1]: Started CTDB.
>   -- Subject: Unit ctdb.service has finished start-up
>   -- Defined-By: systemd
>   -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
>   --
>   -- Unit ctdb.service has finished starting up.
>   --
>   -- The start-up result is done.
>
> The relevant part of the log while stopping seems to be:
>
>   Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710743
> (ctdb-eventd).
>   Oct 11 00:56:47 rocky1 systemd[1]: Child 710742 (ctdbd) died
> (code=exited, status=0/SUCCESS)
>   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710742 belongs to
> ctdb.service.
>   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Can't open PID file
> /var/run/ctdb/ctdbd.pid (yet?) after stop: No such file or directory
>   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Main process exited,
> code=exited, status=0/SUCCESS
>   Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal
> sender=org.freedesktop.systemd1 destination=n/a
> path=/org/freedesktop/systemd1/unit/ctdb_2eservice
> interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> cookie=54 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
>   Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal
> sender=org.freedesktop.systemd1 destination=n/a
> path=/org/freedesktop/systemd1/unit/ctdb_2eservice
> interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> cookie=55 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
>   Oct 11 00:56:47 rocky1 systemd[1]: Child 710743 (ctdb-eventd) died
> (code=exited, status=0/SUCCESS)
>   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710743 belongs to
> ctdb.service.
>   Oct 11 00:56:47 rocky1 systemd[1]: systemd-journald.service: Received
> EPOLLHUP on stored fd 18 (stored), closing.
>   Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710860
> (ctdb).
>   Oct 11 00:56:47 rocky1 systemd[1]: Child 710860 (ctdb) died
> (code=exited, status=0/SUCCESS)
>   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710860 belongs to
> ctdb.service.
>   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Control process exited,
> code=exited status=0
>   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Got final SIGCHLD for
> state stop.
>   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Changed stop ->
> stop-sigterm
>   Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal
> sender=org.freedesktop.systemd1 destination=n/a
> path=/org/freedesktop/systemd1/unit/ctdb_2eservice
> interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> cookie=56 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
>   Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal
> sender=org.freedesktop.systemd1 destination=n/a
> path=/org/freedesktop/systemd1/unit/ctdb_2eservice
> interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> cookie=57 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
>   Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710834
> (ctdb_lock_helpe).
>   Oct 11 00:56:47 rocky1 systemd[1]: Child 710834 (ctdb_lock_helpe) died
> (code=exited, status=0/SUCCESS)
>   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710834 belongs to
> ctdb.service.
>   Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710835
> (ctdb_lock_helpe).
>   Oct 11 00:56:47 rocky1 systemd[1]: Child 710832 (ctdb_lock_helpe) died
> (code=exited, status=0/SUCCESS)
>   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710832 belongs to
> ctdb.service.
>   Oct 11 00:56:47 rocky1 systemd[1]: Child 710835 (ctdb_lock_helpe) died
> (code=exited, status=0/SUCCESS)
>   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710835 belongs to
> ctdb.service.
>   Oct 11 00:56:47 rocky1 systemd[1]: Child 710833 (ctdb_lock_helpe) died
> (code=exited, status=0/SUCCESS)
>   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710833 belongs to
> ctdb.service.
>   Oct 11 00:56:47 rocky1 systemd[1]: Child 710836 (ctdb_lock_helpe) died
> (code=exited, status=0/SUCCESS)
>   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710836 belongs to
> ctdb.service.
>   Oct 11 00:56:47 rocky1 systemd[1]: Child 710837 (ctdb_lock_helpe) died
> (code=exited, status=0/SUCCESS)
>   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710837 belongs to
> ctdb.service.
>   Oct 11 00:56:47 rocky1 systemd[1]: Child 710838 (ctdb_lock_helpe) died
> (code=exited, status=0/SUCCESS)
>   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710838 belongs to
> ctdb.service.
>   Oct 11 00:56:47 rocky1 systemd[1]: Child 710839 (ctdb_lock_helpe) died
> (code=exited, status=0/SUCCESS)
>   Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710839 belongs to
> ctdb.service.
>   Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710832
> (n/a).
>   Oct 11 00:57:44 rocky1 systemd[1]: systemd-journald.service: Got
> notification message from PID 840 (WATCHDOG=1)
>   Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: State 'stop-sigterm'
> timed out. Killing.
>   Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Failed with result
> 'timeout'.
>   -- Subject: Unit failed
>   -- Defined-By: systemd
>   -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
>   --
>   -- The unit ctdb.service has entered the 'failed' state with result
> 'timeout'.
>   Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Changed stop-sigterm ->
> failed
>   Oct 11 00:58:17 rocky1 systemd[1]: Sent message type=signal
> sender=org.freedesktop.systemd1 destination=n/a
> path=/org/freedesktop/systemd1 interface=org.freedesktop.DBus.Properties
> member=PropertiesChanged cookie=58 reply_cookie=0 signature=sa{sv}as
> error-name=n/a error-message=n/a
>   Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Job ctdb.service/stop
> finished, result=done
>   Oct 11 00:58:17 rocky1 systemd[1]: Stopped CTDB.
>   -- Subject: Unit ctdb.service has finished shutting down
>   -- Defined-By: systemd
>   -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
>   --
>   -- Unit ctdb.service has finished shutting down.
>   Oct 11 00:58:17 rocky1 systemd[1]: Sent message type=signal
> sender=org.freedesktop.systemd1 destination=n/a
> path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager
> member=JobRemoved cookie=59 reply_cookie=0 signature=uoss error-name=n/a
> error-message=n/a
>   Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Unit entered failed
> state.
>
> It would be very useful if systemd could log what it is still waiting
> for when it times out.
>
> Note that during start and stop, CTDB runs a lot of subprocesses,
> including some that use systemctl to start and stop various services
> that it, in turn, manages.
>
> The full debug level log, after running:
>
>   systemd-analyze log-level debug
>
> is uploaded it to:
>
>   https://meltin.net/uploads/systemd/ctdb-stop.log
>
> I'm happy to reply and attach it, but it is 48KB.
>
> The only theory I can come up with is some sort of race where
> processes are created during shutdown and systemd gets confused.
>
> I see a similar thing for a much simpler service, winbind:
>
> Here is winbind.service:
>
>   [Unit]
>   Description=Samba Winbind Daemon
>   Documentation=man:winbindd(8) man:samba(7) man:smb.conf(5)
>   After=network.target nmb.service
>
>   [Service]
>   Type=notify
>   PIDFile=/var/run/winbindd.pid
>   EnvironmentFile=-/etc/sysconfig/samba
>   ExecStart=/usr/sbin/winbindd --foreground --no-process-group
> $WINBINDOPTIONS
>   ExecReload=/bin/kill -HUP $MAINPID
>   LimitCORE=infinity
>
>   [Install]
>   WantedBy=multi-user.target
>
> Yesterday I watched it do the same thing as CTDB.  I could start the
> service by hand but it would time out during stop, nearly every time,
> even though there were no relevant processes running anymore.
> winbindd sends a READY=1 notification after successfully starting.  It
> does not send STOPPING=1.  winbindd is much simpler during shutdown.
> I can get logs for this one too if necessary.
>
> Thanks for any help.
>
> peace & happiness,
> martin
>


-- 
Mantas Mikulėnas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/systemd-devel/attachments/20231011/0e168e71/attachment-0001.htm>


More information about the systemd-devel mailing list