[systemd-devel] Systemd, cgrupsv2, cgrulesengd, and nftables
Andrei Borzenkov
arvidjaar at gmail.com
Sat Jun 15 14:37:22 UTC 2024
On 15.06.2024 16:58, Mikhail Morfikov wrote:
> On 15/06/2024 2.27 pm, Andrei Borzenkov wrote:
>> On 15.06.2024 14:02, Mikhail Morfikov wrote:
>>>
>>> But there's no curl pids in /sys/fs/cgroup/user.slice/user-1000.slice/user at 1000.service/cgroup.procs .
>>> To be more specific, there's no pids at all in this cgroup.procs file. The curl pids are under
>>>
>>> # cat /sys/fs/cgroup/morfikownia/user/curl/pids.current
>>> 1
>>>
>>> # cat /sys/fs/cgroup/morfikownia/user/curl/cgroup.procs
>>> 44907
>>>
>>> And this cgroup path (morfikownia/user/curl/) is permitted in nftables, and
>>> yet packets sometimes are visible like they had user.slice/user-1000.slice/user at 1000.service/
>>> path set. Why?
>>
>> Because curl starts in this hierarchy and attempts network connection before your daemon moves curl into different cgroup. It is just as good stab in the dark as any other.
>>
>
> No, it's not like this. When curl attempts to access the internet, it sends
> SYN packet, which is dropped in nftables because of the wrong cgroup path.
> If what you say was true, then the next (or any other) SYN packet would be
> accepted, since the pid is in the right cgroup path now, which is permitted in
> nftabels.
>
> But when I watch the nftables logs, I see something like this:
>
> Jun 15 15:30:57 morfikownia kernel: * cgroup * IN= OUT=bond0 SRC=192.168.1.150 DST=212.77.98.9 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=52657 DF PROTO=TCP SPT=41760 DPT=80 SEQ=3391855235 ACK=0 WINDOW=65535 RES=0x00 SYN URGP=0 OPT (020405B40402080A96453BC0000000000103030E) UID=1000 GID=1000
> Jun 15 15:30:59 morfikownia kernel: * cgroup * IN= OUT=bond0 SRC=192.168.1.150 DST=212.77.98.9 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=52658 DF PROTO=TCP SPT=41760 DPT=80 SEQ=3391855235 ACK=0 WINDOW=65535 RES=0x00 SYN URGP=0 OPT (020405B40402080A96453FCB000000000103030E) UID=1000 GID=1000
> Jun 15 15:31:00 morfikownia kernel: * cgroup * IN= OUT=bond0 SRC=192.168.1.150 DST=212.77.98.9 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=52659 DF PROTO=TCP SPT=41760 DPT=80 SEQ=3391855235 ACK=0 WINDOW=65535 RES=0x00 SYN URGP=0 OPT (020405B40402080A964543CB000000000103030E) UID=1000 GID=1000
> Jun 15 15:31:01 morfikownia kernel: * cgroup * IN= OUT=bond0 SRC=192.168.1.150 DST=212.77.98.9 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=52660 DF PROTO=TCP SPT=41760 DPT=80 SEQ=3391855235 ACK=0 WINDOW=65535 RES=0x00 SYN URGP=0 OPT (020405B40402080A964547CB000000000103030E) UID=1000 GID=1000
> Jun 15 15:31:02 morfikownia kernel: * cgroup * IN= OUT=bond0 SRC=192.168.1.150 DST=212.77.98.9 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=52661 DF PROTO=TCP SPT=41760 DPT=80 SEQ=3391855235 ACK=0 WINDOW=65535 RES=0x00 SYN URGP=0 OPT (020405B40402080A96454BCB000000000103030E) UID=1000 GID=1000
> Jun 15 15:31:03 morfikownia kernel: * cgroup * IN= OUT=bond0 SRC=192.168.1.150 DST=212.77.98.9 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=52662 DF PROTO=TCP SPT=41760 DPT=80 SEQ=3391855235 ACK=0 WINDOW=65535 RES=0x00 SYN URGP=0 OPT (020405B40402080A96454FCB000000000103030E) UID=1000 GID=1000
> Jun 15 15:31:05 morfikownia kernel: * cgroup * IN= OUT=bond0 SRC=192.168.1.150 DST=212.77.98.9 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=52663 DF PROTO=TCP SPT=41760 DPT=80 SEQ=3391855235 ACK=0 WINDOW=65535 RES=0x00 SYN URGP=0 OPT (020405B40402080A964557CB000000000103030E) UID=1000 GID=1000
> Jun 15 15:31:09 morfikownia kernel: * cgroup * IN= OUT=bond0 SRC=192.168.1.150 DST=212.77.98.9 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=52664 DF PROTO=TCP SPT=41760 DPT=80 SEQ=3391855235 ACK=0 WINDOW=65535 RES=0x00 SYN URGP=0 OPT (020405B40402080A9645678B000000000103030E) UID=1000 GID=1000
> Jun 15 15:31:17 morfikownia kernel: * cgroup * IN= OUT=bond0 SRC=192.168.1.150 DST=212.77.98.9 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=52665 DF PROTO=TCP SPT=41760 DPT=80 SEQ=3391855235 ACK=0 WINDOW=65535 RES=0x00 SYN URGP=0 OPT (020405B40402080A964588CB000000000103030E) UID=1000 GID=1000
>
> Pay attention to the timestamp. All the packets comes from the same curl
> connection. So we have beginning at 15:30:57 and end at 15:31:17 (20s window),
> and then was ctrl+c, because it's not going to work.
>
> So the pid is in the right cgroup path for sure before sending the SYN packets.
> If the very first SYN packet was dropped, that would make sense, I mean the
> theory with the app accessing net before cgrulesengd moves the pid. But we have
> 20s, the pid is in the right cgroup and sometimes it works, and sometimes it
> doesn't, I mean curl is able to access the net or not. And that's weird.
>
Not really. nftables checks the *socket* cgroup, not the *process*
cgroup. The socket may have been created while process was in the old
cgroup.
I do not know whether kernel attempts to also move all process sockets
to the new cgroup. I suspect not, but that is most certainly the
question to the kernel folks.
See my other response about atomically placing a process to some
pre-existing cgroup from the very beginning.
> It looks like the cgroup path isn't updated for some reason -- that's my blind
> guess, because the pid is in the right place, the nftables rule works, and yet
> the cgroup path "internally somewhere" is user.slice/user-1000.slice/user at 1000.service/
> instead of the right one, where the pid was moved. I bet there's a bug somewhere.
>
More information about the systemd-devel
mailing list