<div dir="ltr">Is this with cgroups v1 or v2? If cgroups v1 is involved (thanks Docker), I recall it was a bit complex for systemd to get notified when the cgroup actually empties – via /sys/fs/cgroup/systemd/release_agent that specifies a helper executable that the kernel runs... I wonder if that mechanism is broken on your system.<br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Oct 11, 2023 at 7:38 AM Martin Schwenke <<a href="mailto:martin@meltin.net">martin@meltin.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I'm seeing "systemctl stop <service>" for several services taking a<br>
long time because it goes through the timeout process, even though all<br>
relevant processes have exited.<br>
<br>
I'll give 2 examples. Both examples are running inside a privileged<br>
Rocky Linux 8.8 Docker container on a Rocky Linux 8.8 host. The<br>
systemd version, reported by "systemctl --version" in the container<br>
is:<br>
<br>
systemd 239 (239-74.el8_8.5)<br>
<br>
Here is ctdb.system:<br>
<br>
[Unit]<br>
Description=CTDB<br>
Documentation=man:ctdbd(1) man:ctdb(7)<br>
After=network-online.target time-sync.target<br>
ConditionFileNotEmpty=/etc/ctdb/nodes<br>
<br>
[Service]<br>
Type=forking<br>
LimitCORE=infinity<br>
LimitNOFILE=1048576<br>
TasksMax=4096<br>
PIDFile=/var/run/ctdb/ctdbd.pid<br>
ExecStart=/usr/sbin/ctdbd<br>
ExecStop=/usr/bin/ctdb shutdown<br>
KillMode=control-group<br>
Restart=no<br>
<br>
[Install]<br>
WantedBy=multi-user.target<br>
<br>
"/usr/bin/ctdb shutdown" causes a controlled shutdown. In many cases,<br>
starting and then stopping using systemctl works fine. However, many<br>
times it takes >90s to stop, as per TimeoutStopSec. If I reduce that<br>
value then the duration reduces accordingly. I can confirm using both<br>
"ps auxfww" and "systemd-cgls" that within the container there are no<br>
relevant processes a moment after "systemctl stop ctdb" is run. In<br>
particular, in systemd-cgls ctdb.service is gone but "systemctl stop<br>
ctdb" is still waiting.<br>
<br>
Before attempting to stop, the service is successfully started:<br>
<br>
Oct 11 00:56:44 rocky1 systemd[710741]: ctdb.service: Executing: /usr/sbin/ctdbd<br>
Oct 11 00:56:44 rocky1 ctdbd[710741]: CTDB logging to location file:/var/log/log.ctdb<br>
Oct 11 00:56:44 rocky1 systemd[1]: Received SIGCHLD from PID 710741 (ctdbd).<br>
Oct 11 00:56:44 rocky1 systemd[1]: Child 710741 (ctdbd) died (code=exited, status=0/SUCCESS)<br>
Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Child 710741 belongs to ctdb.service.<br>
Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Control process exited, code=exited status=0<br>
Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Got final SIGCHLD for state start.<br>
Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: New main PID 710742 belongs to service, we are happy.<br>
Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Main PID loaded: 710742<br>
Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Changed start -> running<br>
Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Job ctdb.service/start finished, result=done<br>
Oct 11 00:56:44 rocky1 systemd[1]: Started CTDB.<br>
-- Subject: Unit ctdb.service has finished start-up<br>
-- Defined-By: systemd<br>
-- Support: <a href="https://lists.freedesktop.org/mailman/listinfo/systemd-devel" rel="noreferrer" target="_blank">https://lists.freedesktop.org/mailman/listinfo/systemd-devel</a><br>
-- <br>
-- Unit ctdb.service has finished starting up.<br>
-- <br>
-- The start-up result is done.<br>
<br>
The relevant part of the log while stopping seems to be:<br>
<br>
Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710743 (ctdb-eventd).<br>
Oct 11 00:56:47 rocky1 systemd[1]: Child 710742 (ctdbd) died (code=exited, status=0/SUCCESS)<br>
Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710742 belongs to ctdb.service.<br>
Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Can't open PID file /var/run/ctdb/ctdbd.pid (yet?) after stop: No such file or directory<br>
Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Main process exited, code=exited, status=0/SUCCESS<br>
Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal sender=org.freedesktop.systemd1 destination=n/a path=/org/freedesktop/systemd1/unit/ctdb_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=54 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a<br>
Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal sender=org.freedesktop.systemd1 destination=n/a path=/org/freedesktop/systemd1/unit/ctdb_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=55 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a<br>
Oct 11 00:56:47 rocky1 systemd[1]: Child 710743 (ctdb-eventd) died (code=exited, status=0/SUCCESS)<br>
Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710743 belongs to ctdb.service.<br>
Oct 11 00:56:47 rocky1 systemd[1]: systemd-journald.service: Received EPOLLHUP on stored fd 18 (stored), closing.<br>
Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710860 (ctdb).<br>
Oct 11 00:56:47 rocky1 systemd[1]: Child 710860 (ctdb) died (code=exited, status=0/SUCCESS)<br>
Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710860 belongs to ctdb.service.<br>
Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Control process exited, code=exited status=0<br>
Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Got final SIGCHLD for state stop.<br>
Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Changed stop -> stop-sigterm<br>
Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal sender=org.freedesktop.systemd1 destination=n/a path=/org/freedesktop/systemd1/unit/ctdb_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=56 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a<br>
Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal sender=org.freedesktop.systemd1 destination=n/a path=/org/freedesktop/systemd1/unit/ctdb_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=57 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a<br>
Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710834 (ctdb_lock_helpe).<br>
Oct 11 00:56:47 rocky1 systemd[1]: Child 710834 (ctdb_lock_helpe) died (code=exited, status=0/SUCCESS)<br>
Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710834 belongs to ctdb.service.<br>
Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710835 (ctdb_lock_helpe).<br>
Oct 11 00:56:47 rocky1 systemd[1]: Child 710832 (ctdb_lock_helpe) died (code=exited, status=0/SUCCESS)<br>
Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710832 belongs to ctdb.service.<br>
Oct 11 00:56:47 rocky1 systemd[1]: Child 710835 (ctdb_lock_helpe) died (code=exited, status=0/SUCCESS)<br>
Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710835 belongs to ctdb.service.<br>
Oct 11 00:56:47 rocky1 systemd[1]: Child 710833 (ctdb_lock_helpe) died (code=exited, status=0/SUCCESS)<br>
Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710833 belongs to ctdb.service.<br>
Oct 11 00:56:47 rocky1 systemd[1]: Child 710836 (ctdb_lock_helpe) died (code=exited, status=0/SUCCESS)<br>
Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710836 belongs to ctdb.service.<br>
Oct 11 00:56:47 rocky1 systemd[1]: Child 710837 (ctdb_lock_helpe) died (code=exited, status=0/SUCCESS)<br>
Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710837 belongs to ctdb.service.<br>
Oct 11 00:56:47 rocky1 systemd[1]: Child 710838 (ctdb_lock_helpe) died (code=exited, status=0/SUCCESS)<br>
Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710838 belongs to ctdb.service.<br>
Oct 11 00:56:47 rocky1 systemd[1]: Child 710839 (ctdb_lock_helpe) died (code=exited, status=0/SUCCESS)<br>
Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710839 belongs to ctdb.service.<br>
Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710832 (n/a).<br>
Oct 11 00:57:44 rocky1 systemd[1]: systemd-journald.service: Got notification message from PID 840 (WATCHDOG=1)<br>
Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: State 'stop-sigterm' timed out. Killing.<br>
Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Failed with result 'timeout'.<br>
-- Subject: Unit failed<br>
-- Defined-By: systemd<br>
-- Support: <a href="https://lists.freedesktop.org/mailman/listinfo/systemd-devel" rel="noreferrer" target="_blank">https://lists.freedesktop.org/mailman/listinfo/systemd-devel</a><br>
-- <br>
-- The unit ctdb.service has entered the 'failed' state with result 'timeout'.<br>
Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Changed stop-sigterm -> failed<br>
Oct 11 00:58:17 rocky1 systemd[1]: Sent message type=signal sender=org.freedesktop.systemd1 destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=58 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a<br>
Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Job ctdb.service/stop finished, result=done<br>
Oct 11 00:58:17 rocky1 systemd[1]: Stopped CTDB.<br>
-- Subject: Unit ctdb.service has finished shutting down<br>
-- Defined-By: systemd<br>
-- Support: <a href="https://lists.freedesktop.org/mailman/listinfo/systemd-devel" rel="noreferrer" target="_blank">https://lists.freedesktop.org/mailman/listinfo/systemd-devel</a><br>
-- <br>
-- Unit ctdb.service has finished shutting down.<br>
Oct 11 00:58:17 rocky1 systemd[1]: Sent message type=signal sender=org.freedesktop.systemd1 destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=JobRemoved cookie=59 reply_cookie=0 signature=uoss error-name=n/a error-message=n/a<br>
Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Unit entered failed state.<br>
<br>
It would be very useful if systemd could log what it is still waiting<br>
for when it times out.<br>
<br>
Note that during start and stop, CTDB runs a lot of subprocesses,<br>
including some that use systemctl to start and stop various services<br>
that it, in turn, manages.<br>
<br>
The full debug level log, after running:<br>
<br>
systemd-analyze log-level debug<br>
<br>
is uploaded it to:<br>
<br>
<a href="https://meltin.net/uploads/systemd/ctdb-stop.log" rel="noreferrer" target="_blank">https://meltin.net/uploads/systemd/ctdb-stop.log</a><br>
<br>
I'm happy to reply and attach it, but it is 48KB.<br>
<br>
The only theory I can come up with is some sort of race where<br>
processes are created during shutdown and systemd gets confused.<br>
<br>
I see a similar thing for a much simpler service, winbind:<br>
<br>
Here is winbind.service:<br>
<br>
[Unit]<br>
Description=Samba Winbind Daemon<br>
Documentation=man:winbindd(8) man:samba(7) man:smb.conf(5)<br>
After=network.target nmb.service<br>
<br>
[Service]<br>
Type=notify<br>
PIDFile=/var/run/winbindd.pid<br>
EnvironmentFile=-/etc/sysconfig/samba<br>
ExecStart=/usr/sbin/winbindd --foreground --no-process-group $WINBINDOPTIONS<br>
ExecReload=/bin/kill -HUP $MAINPID<br>
LimitCORE=infinity<br>
<br>
[Install]<br>
WantedBy=multi-user.target<br>
<br>
Yesterday I watched it do the same thing as CTDB. I could start the<br>
service by hand but it would time out during stop, nearly every time,<br>
even though there were no relevant processes running anymore.<br>
winbindd sends a READY=1 notification after successfully starting. It<br>
does not send STOPPING=1. winbindd is much simpler during shutdown.<br>
I can get logs for this one too if necessary.<br>
<br>
Thanks for any help.<br>
<br>
peace & happiness,<br>
martin<br>
</blockquote></div><br clear="all"><br><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature"><div dir="ltr">Mantas Mikulėnas</div></div>