<HTML><BODY><div>Thank you for the explanation!<br><br>> systemd isn't aware of it and it would clean the hierarchy according to its configuration<br>Yup, that is what I would expect from systemd after reading docs — to be a full, robust and confident owner of the cgroups hierarchies on a host.<br>However, if you don’t mind to go deeper into «an undefined behavior» and dig a bit into bits of implementation, let me continue exploring this particular case.<br><br>It’s interesting for me mostly for educational purpose.<br>For you it might be interesting in sake of improving robustness of systemd in case of such invaders as kubelet+cgroupfs : )<br><br><br>########## (1) abandoned cgroup ##########<br>> systemd isn't aware of it and it would clean the hierarchy according to its configuration<br>systemd hasn’t deleted the unknown hierarchy, it’s still presented:<div># ls -lah /sys/fs/cgroup/systemd/kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/<br>total 0<br>drwxr-xr-x. 3 root root 0 Jul 16 08:06 .<br>drwxr-xr-x. 130 root root 0 Jul 16 08:06 ..<br>drwxr-xr-x. 3 root root 0 Jul 16 08:10 8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15<br>-rw-r--r--. 1 root root 0 Jul 17 06:04 cgroup.clone_children<br>-rw-r--r--. 1 root root 0 Jul 17 06:04 cgroup.procs<br>-rw-r--r--. 1 root root 0 Jul 17 06:04 notify_on_release<br>-rw-r--r--. 1 root root 0 Jul 17 06:04 tasks</div><div><br>cgroup.procs here and in it’s child cgroup 8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15 are empty.<br>Seems there are no processes attached to these cgroups. Date of creation is Jul 16-17.<br><br>########## (2) mysterious mount of systemd hierarchy ##########</div>Let’s look at it from another point of view. From point of view of host mounts. We’ve already seen it. On host we can see two mounts of the same hierarchy:<div># cat /proc/self/mountinfo | grep cgr | grep syst<br><br>26 25 0:23 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd<br><br>2826 26 0:23 /kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15 /sys/fs/cgroup/systemd/kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15 rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd</div><div> </div>Seems to be cyclic mount. Questions are who, why and when did the second mysterious mount?<br>I have two candidates:<br>- runc during container creation;<br>- systemd, probably because it was confused by kubelet and it’s unexpected usage of cgroups.<br><br>########## (3) suspected owner of mysterious mount is systemd-nspawn machine ##########<br>Let’s look at the situation from third point of view. From systemd-nspawn point of view:<div># machinectl list<br>MACHINE CLASS SERVICE OS VERSION ADDRESSES<br>centos75 container systemd-nspawn centos 7 - <br>frr container systemd-nspawn ubuntu 18.04 - <br><br>2 machines listed.</div><br>Let’s explore cgroups of centos75 machine:<div># ls -lah /sys/fs/cgroup/systemd/machine.slice/systemd-nspawn\@centos75.service/payload/system.slice/ | grep sys-fs-cgroup-systemd<br><br>drwxr-xr-x. 2 root root 0 Nov 9 20:07 host\x2drootfs-sys-fs-cgroup-systemd-kubepods-burstable-pod7ffde41a\x2dfa85\x2d4b01\x2d8023\x2d69a4e4b50c55-8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15.mount<br><br>drwxr-xr-x. 2 root root 0 Jul 16 08:05 host\x2drootfs-sys-fs-cgroup-systemd.mount<br><br>drwxr-xr-x. 2 root root 0 Jul 16 08:05 host\x2drootfs-var-lib-machines-centos75-sys-fs-cgroup-systemd.mount</div><div> </div>There are three interesting cgroups in container. First one seems to be in relation with the abandoned cgroup and mysterious mount on the host.<br><br>Creation date is Nov 9 20:07. I’ve updated kubelet at Nov 8 12:01. Сoincidence?! I don't think so.<br><br>##### questions #####<br>Unfortunately I don’t know how to check creation date/time of mount point (2826 26 0:23) on host system.<br>Probably systemd-nspawn is disrupted with abandoned cgroup created by kubelet.<br><br>Q1. Let me ask, what is the meaning of mount inside centos75 container?<br>/system.slice/host\x2drootfs-sys-fs-cgroup-systemd-kubepods-burstable-pod7ffde41a\x2dfa85\x2d4b01\x2d8023\x2d69a4e4b50c55-8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15.mount<br><br>Q2. Why the mount appeared in the container at Nov 9, 20:07 ?<br><br><br>Understanding of the logic behind such situation, even though it’s obviously wrong usage of systemd and kubelet+cgroupfs, will help us to make some part(s) more robust and resistant for such kind of interventions.<br><br>##### mind-blowing but migh be important note #####<br>Here is one node in another cluster which is still not updated to kubelet 1.19.2 (update to 1.19.2 reveals the situation since kubelet starts to crash).<br>It runs kubelet v1.18.6 with hyperkube inside rkt.<br><br>The node already seems to have not healthy mounts:<br> <div># cat /proc/self/mountinfo |grep systemd | grep cgr<br>26 25 0:23 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd</div><div>866 865 0:23 / /var/lib/rkt/pods/run/3720606d-535b-4e59-a137-ee00246a20c1/stage1/rootfs/opt/stage2/hyperkube-amd64/rootfs/sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd</div><div>5253 26 0:23 /kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3/4a81a28292c3250e03c27a7270cdf58a07940e462999ab3e2be51c01b3a6bf10 /sys/fs/cgroup/systemd/kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3/4a81a28292c3250e03c27a7270cdf58a07940e462999ab3e2be51c01b3a6bf10 rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd</div><div>5251 866 0:23 /kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3/4a81a28292c3250e03c27a7270cdf58a07940e462999ab3e2be51c01b3a6bf10 /var/lib/rkt/pods/run/3720606d-535b-4e59-a137-ee00246a20c1/stage1/rootfs/opt/stage2/hyperkube-amd64/rootfs/sys/fs/cgroup/systemd/kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3/4a81a28292c3250e03c27a7270cdf58a07940e462999ab3e2be51c01b3a6bf10 rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd</div><div> </div>Also seems systemd-nspawn is not affected yet, since there is no such cgroup inside centos75 container (we have it on each machine) but only abandoned one, with empty cgroup.procs:<br># find /sys/fs/ -name '*64ad01*'<br>/sys/fs/cgroup/systemd/kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3<br><br> <blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;">Thursday, November 19, 2020 7:32 PM +09:00 from Michal Koutný <mkoutny@suse.com>:<br> <div id=""><div class="js-helper js-readmsg-msg"><div><div id="style_16057819581011383991_BODY">Hi.<br><br>On Wed, Nov 18, 2020 at 09:46:03PM +0300, Andrei Enshin <<a href="/compose?To=b1os@bk.ru">b1os@bk.ru</a>> wrote:<br>> Just out of curiosity, how systemd in particular may be disrupted with<br>> such record in root of it’s cgroups hierarchy as /kubpods/bla/bla<br>> during service (de)activation?<br>> Or how it may disrupt the kubelet or workload running by it?<br>If processes from kubeletet.service are migrated elsewhere, systemd may<br>lose ability to associate it with the service (which may or may not be<br>correct, I didn't check this particular case).<br><br>In the opposite direction, if container runtime builds up a hierarchy<br>for a controller, systemd isn't aware of it and it would clean the<br>hierarchy according to its configuration (which can, for instance, be no<br>controllers at all) and happens during unit (de)activation. The<br>containers can get away with it when there are no unit changes at the<br>moment but that's not what you want. Furthermore, since cgroup<br>operations for a unit usually involve family [1], the interference may<br>happen even when apparently unrelated unit changes. (This applies to the<br>most common "hybrid" cgroup layout.)<br><br>> Seems I missed some technical details how exact it will interfere.<br>There's the defined interface (delegation or DBus API) and both parties<br>(systemd, container runtimes) have freedom to implement cgroups as they<br>wish within these limits.<br>If they overlap though, you get an undefined behavior in principle.<br>That's the reason why to stick to this convention.<br><br>Michal<br><br><br>[1] This is rather an implementation detail<br> <a href="https://github.com/systemd/systemd/blob/f56a9cbf9c20cd798258d3db302d51bf21458b38/src/core/cgroup.c#L2326" target="_blank">https://github.com/systemd/systemd/blob/f56a9cbf9c20cd798258d3db302d51bf21458b38/src/core/cgroup.c#L2326</a><br><br> </div></div></div></div></blockquote> <div> </div><div data-signature-widget="container"><div data-signature-widget="content"><p>---</p><p>Best Regards,<br>Andrei Enshin</p></div></div><div> </div></div></BODY></HTML>