[systemd-devel] name=systemd cgroup mounts/hierarchy

Thu Nov 19 19:14:18 UTC 2020

Thank you for the explanation!

> systemd isn't aware of it and it would clean the hierarchy according to its configuration
Yup, that is what I would expect from systemd after reading docs — to be a full, robust and confident owner of the cgroups hierarchies on a host.
However, if you don’t mind to go deeper into «an undefined behavior» and dig a bit into bits of implementation, let me continue exploring this particular case.

It’s interesting for me mostly for educational purpose.
For you it might be interesting in sake of improving robustness of systemd in case of such invaders as kubelet+cgroupfs : )

########## (1) abandoned cgroup ##########
> systemd isn't aware of it and it would clean the hierarchy according to its configuration
systemd hasn’t deleted the unknown hierarchy, it’s still presented:
# ls -lah /sys/fs/cgroup/systemd/kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/
total 0
drwxr-xr-x.   3 root root 0 Jul 16 08:06 .
drwxr-xr-x. 130 root root 0 Jul 16 08:06 ..
drwxr-xr-x.   3 root root 0 Jul 16 08:10 8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15
-rw-r--r--.   1 root root 0 Jul 17 06:04 cgroup.clone_children
-rw-r--r--.   1 root root 0 Jul 17 06:04 cgroup.procs
-rw-r--r--.   1 root root 0 Jul 17 06:04 notify_on_release
-rw-r--r--.   1 root root 0 Jul 17 06:04 tasks

cgroup.procs here and in it’s child cgroup 8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15 are empty.
Seems there are no processes attached to these cgroups. Date of creation is Jul 16-17.

########## (2) mysterious mount of systemd hierarchy ########## Let’s look at it from another point of view. From point of view of host mounts. We’ve already seen it. On host we can see two mounts of the same hierarchy:
# cat /proc/self/mountinfo | grep cgr | grep syst

26 25 0:23 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd

2826 26 0:23 /kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15 /sys/fs/cgroup/systemd/kubepods/burstable/pod7ffde41a-fa85-4b01-8023-69a4e4b50c55/8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15 rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
  Seems to be cyclic mount. Questions are who, why and when did the second mysterious mount?
I have two candidates:
- runc during container creation;
- systemd, probably because it was confused by kubelet and it’s unexpected usage of cgroups.

########## (3) suspected owner of mysterious mount is systemd-nspawn machine ##########
Let’s look at the situation from third point of view. From systemd-nspawn point of view:
# machinectl list
MACHINE  CLASS     SERVICE             OS        VERSION  ADDRESSES
centos75    container  systemd-nspawn  centos  7                -        
frr               container  systemd-nspawn  ubuntu  18.04         -        

2 machines listed.
Let’s explore cgroups of centos75 machine:
# ls -lah /sys/fs/cgroup/systemd/machine.slice/systemd-nspawn\@centos75.service/payload/system.slice/ | grep sys-fs-cgroup-systemd

drwxr-xr-x.   2 root root 0 Nov  9 20:07 host\x2drootfs-sys-fs-cgroup-systemd-kubepods-burstable-pod7ffde41a\x2dfa85\x2d4b01\x2d8023\x2d69a4e4b50c55-8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15.mount

drwxr-xr-x.   2 root root 0 Jul 16 08:05 host\x2drootfs-sys-fs-cgroup-systemd.mount

drwxr-xr-x.   2 root root 0 Jul 16 08:05 host\x2drootfs-var-lib-machines-centos75-sys-fs-cgroup-systemd.mount
  There are three interesting cgroups in container. First one seems to be in relation with the abandoned cgroup and mysterious mount on the host.

Creation date is Nov  9 20:07. I’ve updated kubelet at Nov  8 12:01. Сoincidence?! I don't think so.

##### questions #####
Unfortunately I don’t know how to check creation date/time of mount point (2826 26 0:23) on host system.
Probably systemd-nspawn is disrupted with abandoned cgroup created by kubelet.

Q1. Let me ask, what is the meaning of mount inside centos75 container?
/system.slice/host\x2drootfs-sys-fs-cgroup-systemd-kubepods-burstable-pod7ffde41a\x2dfa85\x2d4b01\x2d8023\x2d69a4e4b50c55-8842def241fac72cb34fdce90297b632f098289270fa92ec04643837f5748c15.mount

Q2. Why the mount appeared in the container at Nov 9, 20:07 ?

Understanding of the logic behind such situation, even though it’s obviously wrong usage of systemd and kubelet+cgroupfs, will help us to make some part(s) more robust and resistant for such kind of interventions.

##### mind-blowing but migh be important note #####
Here is one node in another cluster which is still not updated to kubelet 1.19.2 (update to 1.19.2 reveals the situation since kubelet starts to crash).
It runs kubelet v1.18.6 with hyperkube inside rkt.

The node already seems to have not healthy mounts:

# cat /proc/self/mountinfo |grep systemd | grep cgr
26 25 0:23 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
866 865 0:23 / /var/lib/rkt/pods/run/3720606d-535b-4e59-a137-ee00246a20c1/stage1/rootfs/opt/stage2/hyperkube-amd64/rootfs/sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
5253 26 0:23 /kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3/4a81a28292c3250e03c27a7270cdf58a07940e462999ab3e2be51c01b3a6bf10 /sys/fs/cgroup/systemd/kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3/4a81a28292c3250e03c27a7270cdf58a07940e462999ab3e2be51c01b3a6bf10 rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
5251 866 0:23 /kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3/4a81a28292c3250e03c27a7270cdf58a07940e462999ab3e2be51c01b3a6bf10 /var/lib/rkt/pods/run/3720606d-535b-4e59-a137-ee00246a20c1/stage1/rootfs/opt/stage2/hyperkube-amd64/rootfs/sys/fs/cgroup/systemd/kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3/4a81a28292c3250e03c27a7270cdf58a07940e462999ab3e2be51c01b3a6bf10 rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
  Also seems systemd-nspawn is not affected yet, since there is no such cgroup inside centos75 container (we have it on each machine) but only abandoned one, with empty cgroup.procs:
# find /sys/fs/ -name '*64ad01*'
/sys/fs/cgroup/systemd/kubepods/burstable/pod64ad01cf-5dd4-4283-abe0-8fb8f3f13dc3

>Thursday, November 19, 2020 7:32 PM +09:00 from Michal Koutný <mkoutny at suse.com>:
> 
>Hi.
>
>On Wed, Nov 18, 2020 at 09:46:03PM +0300, Andrei Enshin < b1os at bk.ru > wrote:
>> Just out of curiosity, how systemd in particular may be disrupted with
>> such record in root of it’s cgroups hierarchy as /kubpods/bla/bla
>> during service (de)activation?
>> Or how it may disrupt the kubelet or workload running by it?
>If processes from kubeletet.service are migrated elsewhere, systemd may
>lose ability to associate it with the service (which may or may not be
>correct, I didn't check this particular case).
>
>In the opposite direction, if container runtime builds up a hierarchy
>for a controller, systemd isn't aware of it and it would clean the
>hierarchy according to its configuration (which can, for instance, be no
>controllers at all) and happens during unit (de)activation. The
>containers can get away with it when there are no unit changes at the
>moment but that's not what you want. Furthermore, since cgroup
>operations for a unit usually involve family [1], the interference may
>happen even when apparently unrelated unit changes. (This applies to the
>most common "hybrid" cgroup layout.)
>
>> Seems I missed some technical details how exact it will interfere.
>There's the defined interface (delegation or DBus API) and both parties
>(systemd, container runtimes) have freedom to implement cgroups as they
>wish within these limits.
>If they overlap though, you get an undefined behavior in principle.
>That's the reason why to stick to this convention.
>
>Michal
>
>
>[1] This is rather an implementation detail
>     https://github.com/systemd/systemd/blob/f56a9cbf9c20cd798258d3db302d51bf21458b38/src/core/cgroup.c#L2326
>
>  

---
Best Regards,
Andrei Enshin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/systemd-devel/attachments/20201119/e1755e86/attachment-0001.htm>