<HTML><BODY><div><div>Thank you for the reply!<br><br>> I'm happier when it's a well defined and reproducible case<br>Agree. It’s somehow reproducible but still not clear — during update of k8s some nodes gets into this state with cyclic systemd mount.<br><br>> What systemd version is it? What cgroup setup is it (legacy or hybrid)?</div><div>systemd 241 (241-23-g05e654e+)<br>+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK -SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT -GNUTLS -ACL +XZ +LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD -IDN2 -IDN +PCRE2 default-hierarchy=legacy</div><div><br>It uses legacy setup, there is no /sys/fs/cgroup/unified/</div><div> </div><div>> Anyway you can try tracing mounts systemwide<br>Yup, I’ve set up audit on mount() syscall and was trying to reproduce semi-manually but still no luck:<div># auditctl -l<br>-a always,exit -S mount</div><div> </div>Going to update another cluster with audit on, so can catch who does such mount.<br><br><br><br>> It doesn't mean that the mount was done within the container<br>Yup, however such transient .mount unit appears only inside systemd-nspawn container which runs systemd inside.<br>The machine’s main systemd has no such transient .mount unit.<br>It does not prove that container or systemd do cyclic mount but move my suspicion on it due to lack of other clues (probably wrong).</div><div><br><br>> how was systemd-nspawn instructed to realize mounts for the container<br>Is it defined somewhere in source of systemd-nspawn or in some configs?<br><br><br>> possibly after daemon-reload</div><div>Yup, I did daemon reload in outer systemd but not sure it was done in inner one.<br><br><br>> Is there the conflicting cgroup driver used again?</div><div>Unfortunately, yes. We do use cgroupfs driver widely and for long time.<br>We do consider to migrate out of it as soon as possible.<br>Also I’m thinking to propose/create PR which disable run kubelet on systemd machine with cgroupfs driver with similar check:<br><a href="https://github.com/opencontainers/runc/blob/27227a9358b54c253e3dad85cfe532a256b88e00/libcontainer/cgroups/systemd/common.go#L49">https://github.com/opencontainers/runc/blob/27227a9358b54c253e3dad85cfe532a256b88e00/libcontainer/cgroups/systemd/common.go#L49</a></div><div>But seems k8s folks are not very interested in it</div><div> </div><blockquote style="border-left:1px solid #0857A6; margin:10px; padding:0 0 0 10px;">Tuesday, November 24, 2020 4:40 AM +09:00 from Michal Koutný <mkoutny@suse.com>:<br> <div id=""><div class="js-helper js-readmsg-msg"><div><div id="style_16061604280035424037_BODY"><br>On Thu, Nov 19, 2020 at 10:14:18PM +0300, Andrei Enshin <<a href="/compose?To=b1os@bk.ru">b1os@bk.ru</a>> wrote:<br>> For you it might be interesting in sake of improving robustness of<br>> systemd in case of such invaders as kubelet+cgroupfs : )<br>I think the interface is clearly defined in the CGROUP_DELEGATION<br>document though.<br>I'm happy if a bug can be found in general. I'm happier when it's a well<br>defined and reproducible case.<br><br>> ########## (1) abandoned cgroup ##########<br>> > systemd isn't aware of it and it would clean the hierarchy according to its configuration<br>That was related to a controller hierarchy (which I understood was the<br>k8s issue about).<br><br>Below it is a named hierarchy there it's yet different.<br><br>> systemd hasn’t deleted the unknown hierarchy, it’s still presented:<br>> [...]<br>> cgroup.procs here and in it’s child cgroup 8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15 are empty.<br>> Seems there are no processes attached to these cgroups. Date of creation is Jul 16-17.<br>What systemd version is it? What cgroup setup is it (legacy or hybrid)?<br><br><br>> ########## (2) mysterious mount of systemd hierarchy ##########<br>> [...]<br>> Seems to be cyclic mount. Questions are who, why and when did the second mysterious mount?<br>> I have two candidates:<br>> - runc during container creation;<br>> - systemd, probably because it was confused by kubelet and it’s unexpected usage of cgroups.<br>I don't see why/how would systemd (PID 1) do this (not sure about<br>nspawn). Anyway you can try tracing mounts systemwide (e.g. `perf trace<br>-a -e syscalls:sys_enter_mount`) to find out who does the mount.<br><br>> ########## (3) suspected owner of mysterious mount is systemd-nspawn machine ##########<br>> [...]<br>> Let’s explore cgroups of centos75 machine:<br>> # ls -lah /sys/fs/cgroup/systemd/machine.slice/systemd-nspawn\@centos75.service/payload/system.slice/ | grep sys-fs-cgroup-systemd<br>><br>> drwxr-xr-x. 2 root root 0 Nov 9 20:07 host\x2drootfs-sys-fs-cgroup-systemd-kubepods-burstable-pod7ffde41a\x2dfa85\x2d4b01\x2d8023\x2d69a4e4b50c55-8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15.mount<br>><br>> drwxr-xr-x. 2 root root 0 Jul 16 08:05 host\x2drootfs-sys-fs-cgroup-systemd.mount<br>><br>> drwxr-xr-x. 2 root root 0 Jul 16 08:05 host\x2drootfs-var-lib-machines-centos75-sys-fs-cgroup-systemd.mount<br>> There are three interesting cgroups in container. First one seems to be in relation with the abandoned cgroup and mysterious mount on the host.<br>Note those are cgroups created for .mount units (and under nested<br>payload's system.slice). It tells that within the container a mount<br>point at<br>> host/rootfs/sys/fs/cgroup/systemd/kubepods/burstable/pod7ffde41a/fa85/4b01/8023/69a4e4b50c55/8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15<br>was visible. It doesn't mean that the mount was done within the<br>container.<br><br>I can't tell why was that, it depends how was systemd-nspawn instructed<br>to realize mounts for the container.<br><br>> Creation date is Nov 9 20:07. I’ve updated kubelet at Nov 8 12:01. Сoincidence?! I don't think so.<br>Yes, it can be related. For instance:<br>- The cyclic bind mount happened,<br>- it's visibility was propagated into the nspawn container<br>- and inner systemd created cgroup for the (generated) .mount unit<br> (possibly after daemon-reload).<br><br>> Q1. Let me ask, what is the meaning of mount inside centos75 container?<br>> /system.slice/host\x2drootfs-sys-fs-cgroup-systemd-kubepods-burstable-pod7ffde41a\x2dfa85\x2d4b01\x2d8023\x2d69a4e4b50c55-8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15.mount<br>><br>> Q2. Why the mount appeared in the container at Nov 9, 20:07 ?<br>Hopefully, it's answered above.<br><br>> ##### mind-blowing but migh be important note #####<br>> [...]<br>> The node already seems to have not healthy mounts:<br>Is there the conflicting cgroup driver used again?<br><br>> # cat /proc/self/mountinfo |grep systemd | grep cgr<br>> 26 25 0:23 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd<br>> <span class="js-phone-number">866 865 0</span>:23 / /var/lib/rkt/pods/run/3720606d-535b-4e59-a137-ee00246a20c1/stage1/rootfs/opt/stage2/hyperkube-amd64/rootfs/sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd<br>> 5253 26 0:23 /kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3/4a81a28292c3250e03c27a7270cdf58a07940e462999ab3e2be51c01b3a6bf10 /sys/fs/cgroup/systemd/kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3/4a81a28292c3250e03c27a7270cdf58a07940e462999ab3e2be51c01b3a6bf10 rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd<br>> 5251 866 0:23 /kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3/4a81a28292c3250e03c27a7270cdf58a07940e462999ab3e2be51c01b3a6bf10 /var/lib/rkt/pods/run/3720606d-535b-4e59-a137-ee00246a20c1/stage1/rootfs/opt/stage2/hyperkube-amd64/rootfs/sys/fs/cgroup/systemd/kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3/4a81a28292c3250e03c27a7270cdf58a07940e462999ab3e2be51c01b3a6bf10 rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd<br>> Also seems systemd-nspawn is not affected yet, since there is no such cgroup inside centos75 container (we have it on each machine) but only abandoned one, with empty cgroup.procs:<br>It'd depend on the mounts propagation into that container and what<br>systemd inside that container did (i.e. the mount unit may not have been<br>created yet).<br><br>Michal<br> </div></div></div></div></blockquote> <div> </div><div data-signature-widget="container"><div data-signature-widget="content"><p>---</p><p>Best Regards,<br>Andrei Enshin</p></div></div><div> </div></div></BODY></HTML>