<div dir="auto">What value do you have in /sys/fs/cgroup/systemd/release_agent, as seen by systemd? Does it point to the correct executable?<div dir="auto"><br></div><div dir="auto">Does e.g. forkstat (or execsnoop or similar) show that the executable is being run when the cgroup empties?</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Oct 13, 2023, 04:20 Martin Schwenke <<a href="mailto:martin@meltin.net">martin@meltin.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Mantas,<br>
<br>
Yes, it looks like cgroups v1.<br>
<br>
Would this be a kernel bug? systemd bug?<br>
<br>
Thanks...<br>
<br>
peace & happiness,<br>
martin<br>
<br>
On Wed, 11 Oct 2023 08:19:59 +0300, Mantas Mikulėnas<br>
<<a href="mailto:grawity@gmail.com" target="_blank" rel="noreferrer">grawity@gmail.com</a>> wrote:<br>
<br>
> Is this with cgroups v1 or v2? If cgroups v1 is involved (thanks Docker), I<br>
> recall it was a bit complex for systemd to get notified when the cgroup<br>
> actually empties – via /sys/fs/cgroup/systemd/release_agent that specifies<br>
> a helper executable that the kernel runs... I wonder if that mechanism is<br>
> broken on your system.<br>
> <br>
> On Wed, Oct 11, 2023 at 7:38 AM Martin Schwenke <<a href="mailto:martin@meltin.net" target="_blank" rel="noreferrer">martin@meltin.net</a>> wrote:<br>
> <br>
> > I'm seeing "systemctl stop <service>" for several services taking a<br>
> > long time because it goes through the timeout process, even though all<br>
> > relevant processes have exited.<br>
> ><br>
> > I'll give 2 examples. Both examples are running inside a privileged<br>
> > Rocky Linux 8.8 Docker container on a Rocky Linux 8.8 host. The<br>
> > systemd version, reported by "systemctl --version" in the container<br>
> > is:<br>
> ><br>
> > systemd 239 (239-74.el8_8.5)<br>
> ><br>
> > Here is ctdb.system:<br>
> ><br>
> > [Unit]<br>
> > Description=CTDB<br>
> > Documentation=man:ctdbd(1) man:ctdb(7)<br>
> > After=network-online.target time-sync.target<br>
> > ConditionFileNotEmpty=/etc/ctdb/nodes<br>
> ><br>
> > [Service]<br>
> > Type=forking<br>
> > LimitCORE=infinity<br>
> > LimitNOFILE=1048576<br>
> > TasksMax=4096<br>
> > PIDFile=/var/run/ctdb/ctdbd.pid<br>
> > ExecStart=/usr/sbin/ctdbd<br>
> > ExecStop=/usr/bin/ctdb shutdown<br>
> > KillMode=control-group<br>
> > Restart=no<br>
> ><br>
> > [Install]<br>
> > WantedBy=multi-user.target<br>
> ><br>
> > "/usr/bin/ctdb shutdown" causes a controlled shutdown. In many cases,<br>
> > starting and then stopping using systemctl works fine. However, many<br>
> > times it takes >90s to stop, as per TimeoutStopSec. If I reduce that<br>
> > value then the duration reduces accordingly. I can confirm using both<br>
> > "ps auxfww" and "systemd-cgls" that within the container there are no<br>
> > relevant processes a moment after "systemctl stop ctdb" is run. In<br>
> > particular, in systemd-cgls ctdb.service is gone but "systemctl stop<br>
> > ctdb" is still waiting.<br>
> ><br>
> > Before attempting to stop, the service is successfully started:<br>
> ><br>
> > Oct 11 00:56:44 rocky1 systemd[710741]: ctdb.service: Executing:<br>
> > /usr/sbin/ctdbd<br>
> > Oct 11 00:56:44 rocky1 ctdbd[710741]: CTDB logging to location<br>
> > file:/var/log/log.ctdb<br>
> > Oct 11 00:56:44 rocky1 systemd[1]: Received SIGCHLD from PID 710741<br>
> > (ctdbd).<br>
> > Oct 11 00:56:44 rocky1 systemd[1]: Child 710741 (ctdbd) died<br>
> > (code=exited, status=0/SUCCESS)<br>
> > Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Child 710741 belongs to<br>
> > ctdb.service.<br>
> > Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Control process exited,<br>
> > code=exited status=0<br>
> > Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Got final SIGCHLD for<br>
> > state start.<br>
> > Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: New main PID 710742<br>
> > belongs to service, we are happy.<br>
> > Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Main PID loaded: 710742<br>
> > Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Changed start -> running<br>
> > Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Job ctdb.service/start<br>
> > finished, result=done<br>
> > Oct 11 00:56:44 rocky1 systemd[1]: Started CTDB.<br>
> > -- Subject: Unit ctdb.service has finished start-up<br>
> > -- Defined-By: systemd<br>
> > -- Support: <a href="https://lists.freedesktop.org/mailman/listinfo/systemd-devel" rel="noreferrer noreferrer" target="_blank">https://lists.freedesktop.org/mailman/listinfo/systemd-devel</a><br>
> > --<br>
> > -- Unit ctdb.service has finished starting up.<br>
> > --<br>
> > -- The start-up result is done.<br>
> ><br>
> > The relevant part of the log while stopping seems to be:<br>
> ><br>
> > Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710743<br>
> > (ctdb-eventd).<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: Child 710742 (ctdbd) died<br>
> > (code=exited, status=0/SUCCESS)<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710742 belongs to<br>
> > ctdb.service.<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Can't open PID file<br>
> > /var/run/ctdb/ctdbd.pid (yet?) after stop: No such file or directory<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Main process exited,<br>
> > code=exited, status=0/SUCCESS<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal<br>
> > sender=org.freedesktop.systemd1 destination=n/a<br>
> > path=/org/freedesktop/systemd1/unit/ctdb_2eservice<br>
> > interface=org.freedesktop.DBus.Properties member=PropertiesChanged<br>
> > cookie=54 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal<br>
> > sender=org.freedesktop.systemd1 destination=n/a<br>
> > path=/org/freedesktop/systemd1/unit/ctdb_2eservice<br>
> > interface=org.freedesktop.DBus.Properties member=PropertiesChanged<br>
> > cookie=55 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: Child 710743 (ctdb-eventd) died<br>
> > (code=exited, status=0/SUCCESS)<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710743 belongs to<br>
> > ctdb.service.<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: systemd-journald.service: Received<br>
> > EPOLLHUP on stored fd 18 (stored), closing.<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710860<br>
> > (ctdb).<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: Child 710860 (ctdb) died<br>
> > (code=exited, status=0/SUCCESS)<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710860 belongs to<br>
> > ctdb.service.<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Control process exited,<br>
> > code=exited status=0<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Got final SIGCHLD for<br>
> > state stop.<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Changed stop -><br>
> > stop-sigterm<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal<br>
> > sender=org.freedesktop.systemd1 destination=n/a<br>
> > path=/org/freedesktop/systemd1/unit/ctdb_2eservice<br>
> > interface=org.freedesktop.DBus.Properties member=PropertiesChanged<br>
> > cookie=56 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal<br>
> > sender=org.freedesktop.systemd1 destination=n/a<br>
> > path=/org/freedesktop/systemd1/unit/ctdb_2eservice<br>
> > interface=org.freedesktop.DBus.Properties member=PropertiesChanged<br>
> > cookie=57 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710834<br>
> > (ctdb_lock_helpe).<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: Child 710834 (ctdb_lock_helpe) died<br>
> > (code=exited, status=0/SUCCESS)<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710834 belongs to<br>
> > ctdb.service.<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710835<br>
> > (ctdb_lock_helpe).<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: Child 710832 (ctdb_lock_helpe) died<br>
> > (code=exited, status=0/SUCCESS)<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710832 belongs to<br>
> > ctdb.service.<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: Child 710835 (ctdb_lock_helpe) died<br>
> > (code=exited, status=0/SUCCESS)<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710835 belongs to<br>
> > ctdb.service.<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: Child 710833 (ctdb_lock_helpe) died<br>
> > (code=exited, status=0/SUCCESS)<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710833 belongs to<br>
> > ctdb.service.<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: Child 710836 (ctdb_lock_helpe) died<br>
> > (code=exited, status=0/SUCCESS)<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710836 belongs to<br>
> > ctdb.service.<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: Child 710837 (ctdb_lock_helpe) died<br>
> > (code=exited, status=0/SUCCESS)<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710837 belongs to<br>
> > ctdb.service.<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: Child 710838 (ctdb_lock_helpe) died<br>
> > (code=exited, status=0/SUCCESS)<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710838 belongs to<br>
> > ctdb.service.<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: Child 710839 (ctdb_lock_helpe) died<br>
> > (code=exited, status=0/SUCCESS)<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710839 belongs to<br>
> > ctdb.service.<br>
> > Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710832<br>
> > (n/a).<br>
> > Oct 11 00:57:44 rocky1 systemd[1]: systemd-journald.service: Got<br>
> > notification message from PID 840 (WATCHDOG=1)<br>
> > Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: State 'stop-sigterm'<br>
> > timed out. Killing.<br>
> > Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Failed with result<br>
> > 'timeout'.<br>
> > -- Subject: Unit failed<br>
> > -- Defined-By: systemd<br>
> > -- Support: <a href="https://lists.freedesktop.org/mailman/listinfo/systemd-devel" rel="noreferrer noreferrer" target="_blank">https://lists.freedesktop.org/mailman/listinfo/systemd-devel</a><br>
> > --<br>
> > -- The unit ctdb.service has entered the 'failed' state with result<br>
> > 'timeout'.<br>
> > Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Changed stop-sigterm -><br>
> > failed<br>
> > Oct 11 00:58:17 rocky1 systemd[1]: Sent message type=signal<br>
> > sender=org.freedesktop.systemd1 destination=n/a<br>
> > path=/org/freedesktop/systemd1 interface=org.freedesktop.DBus.Properties<br>
> > member=PropertiesChanged cookie=58 reply_cookie=0 signature=sa{sv}as<br>
> > error-name=n/a error-message=n/a<br>
> > Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Job ctdb.service/stop<br>
> > finished, result=done<br>
> > Oct 11 00:58:17 rocky1 systemd[1]: Stopped CTDB.<br>
> > -- Subject: Unit ctdb.service has finished shutting down<br>
> > -- Defined-By: systemd<br>
> > -- Support: <a href="https://lists.freedesktop.org/mailman/listinfo/systemd-devel" rel="noreferrer noreferrer" target="_blank">https://lists.freedesktop.org/mailman/listinfo/systemd-devel</a><br>
> > --<br>
> > -- Unit ctdb.service has finished shutting down.<br>
> > Oct 11 00:58:17 rocky1 systemd[1]: Sent message type=signal<br>
> > sender=org.freedesktop.systemd1 destination=n/a<br>
> > path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager<br>
> > member=JobRemoved cookie=59 reply_cookie=0 signature=uoss error-name=n/a<br>
> > error-message=n/a<br>
> > Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Unit entered failed<br>
> > state.<br>
> ><br>
> > It would be very useful if systemd could log what it is still waiting<br>
> > for when it times out.<br>
> ><br>
> > Note that during start and stop, CTDB runs a lot of subprocesses,<br>
> > including some that use systemctl to start and stop various services<br>
> > that it, in turn, manages.<br>
> ><br>
> > The full debug level log, after running:<br>
> ><br>
> > systemd-analyze log-level debug<br>
> ><br>
> > is uploaded it to:<br>
> ><br>
> > <a href="https://meltin.net/uploads/systemd/ctdb-stop.log" rel="noreferrer noreferrer" target="_blank">https://meltin.net/uploads/systemd/ctdb-stop.log</a><br>
> ><br>
> > I'm happy to reply and attach it, but it is 48KB.<br>
> ><br>
> > The only theory I can come up with is some sort of race where<br>
> > processes are created during shutdown and systemd gets confused.<br>
> ><br>
> > I see a similar thing for a much simpler service, winbind:<br>
> ><br>
> > Here is winbind.service:<br>
> ><br>
> > [Unit]<br>
> > Description=Samba Winbind Daemon<br>
> > Documentation=man:winbindd(8) man:samba(7) man:smb.conf(5)<br>
> > After=network.target nmb.service<br>
> ><br>
> > [Service]<br>
> > Type=notify<br>
> > PIDFile=/var/run/winbindd.pid<br>
> > EnvironmentFile=-/etc/sysconfig/samba<br>
> > ExecStart=/usr/sbin/winbindd --foreground --no-process-group<br>
> > $WINBINDOPTIONS<br>
> > ExecReload=/bin/kill -HUP $MAINPID<br>
> > LimitCORE=infinity<br>
> ><br>
> > [Install]<br>
> > WantedBy=multi-user.target<br>
> ><br>
> > Yesterday I watched it do the same thing as CTDB. I could start the<br>
> > service by hand but it would time out during stop, nearly every time,<br>
> > even though there were no relevant processes running anymore.<br>
> > winbindd sends a READY=1 notification after successfully starting. It<br>
> > does not send STOPPING=1. winbindd is much simpler during shutdown.<br>
> > I can get logs for this one too if necessary.<br>
> ><br>
> > Thanks for any help.<br>
> ><br>
> > peace & happiness,<br>
> > martin<br>
> > <br>
> <br>
> <br>
> -- <br>
> Mantas Mikulėnas<br>
<br>
<br>
</blockquote></div>