<div dir="ltr"><div>Hi Lennart,</div><div><br></div><div>Thanks for responding to the questions. I realise some of them may have been a little unclear in isolation - my intention was for the two posts I linked to provide the full context, but I understand they contain a lot of text that it's unreasonable to expect people to have time to read! I'll try to clarify for each question below.</div><div><br></div><div>> > - Why are private cgroups mounted read-only in non-privileged containers?<br>><br>> "private cgroups"? What do you mean by that? The controllers?<br>><br>> Controller delegation on cgroupsv1 is simply not safe, that's all. You can provide invalid configuration to the kernel, and DoS the machine through it. cgroups are simply not a suitable privilege boundary on cgroupsv1.<br>><br>> If you want safe delegation, use cgroupsv2, where delegation is safe.<br></div><div><br></div><div>I was referring to the behaviour of '--cgroupns=private' (to 'docker run' or 'podman run') where a cgroup namespace is created for the container. This flag exists under v1 and v2 cgroups. For example, on v2 cgroups the host cgroup path '/sys/fs/cgroup/docker/<ctr>/' would correspond to '/sys/fs/cgroup/' inside the container. Discussed more at <a href="https://www.lewisgaul.co.uk/blog/coding/2022/05/13/cgroups-intro/#cgroup-namespace-options">https://www.lewisgaul.co.uk/blog/coding/2022/05/13/cgroups-intro/#cgroup-namespace-options</a>.</div><div><br></div><div>The question was "why is the cgroupfs mounted read-only inside a container in non-privileged?" - when there's a cgroup namespace it seems it should be safe [under v2 cgroups] for the container to have write access to its cgroupfs?</div><div><br></div><div>The reason for caring about this of course is that it's a requirement for running systemd inside the container. Currently workarounds are required, such as '-v /sys/fs/cgroup:/sys/fs/cgroup' (which cannot be expected to work with a cgroup namespace!) or podman's default '--systemd=true' behaviour of detecting whether systemd is the entrypoint when deciding whether to make the cgroupfs writable. However, I'm trying to understand if there's any good reason for docker/podman not making the container's cgroupfs read-write by default.</div><div><br></div><div>> > - Is it sound to override Docker’s mounting of the private container cgroups under v1?<br>><br>> I don't know what Docker does these days, but they used to be entirely ignorant towards safe cooperation in the cgroup tree. i.e. they ignored <a href="https://systemd.io/CGROUP_DELEGATION" rel="noreferrer" target="_blank">https://systemd.io/CGROUP_DELEGATION</a> in its entirety, as they don't really accepted systemd's existance.</div><div>><br>> Today most distros I think switched over to other ways to run containers, i.e. podman and so on, which have a more professional approach to all this, and can safely cooperate in a cgroup tree.<br></div><div><br></div><div>This question does actually apply to podman too. It might be more appropriately aimed at docker/podman rather than systemd, I was just wondering if anyone had thoughts.</div><div><br></div><div>To rephrase/provide some more context - I have a use-case where a custom bash script is our container entrypoint, where the purpose of the script is to check a few things while still being able to exit the container, and at the end of the script systemd is started (with 'exec /sbin/init'). Since systemd requires write access to the cgroupfs, I was wondering if we could just unmount and recreate the cgroup mount(s) as read-write in this entrypoint script (requiring CAP_SYS_ADMIN to do so of course), overriding the container manager's setup of making the mounts read-only.</div><div><br></div><div>> > - What are the concerns around the approach of passing '-v /sys/fs/cgroup:/sys/fs/cgroup' in terms of the container’s view of its cgroups?</div>><br>> I don't know what this does. Is this a Docker thing?<div><br></div><div>It's a workaround suggested for getting systemd running inside a docker container, overriding docker's behaviour of making the cgroupfs mounts read-only to make them available read-write. There are some references at <a href="https://www.lewisgaul.co.uk/blog/coding/2022/05/13/cgroups-intro/#systemd-inside-docker-containers">https://www.lewisgaul.co.uk/blog/coding/2022/05/13/cgroups-intro/#systemd-inside-docker-containers</a>.</div><div><br></div><div>This workaround seems quite undesirable to me considering it gives full access to the host's cgroupfs and breaks '--cgroupns=private'. This is not needed with podman since '--systemd=always' can be used. But the motivation of the point/question above was to remove the requirement for this docker workaround.</div><div><br></div><div>> > - Is modifying/replacing the cgroup mounts set up by the container engine a reasonable workaround, or could this be fragile?<br>><br>> I am not sure I follow? A workaround for what? One shouldn't assume one even has the privs to modify cgroup mounts.<br>><br>> But why would one even?<br></div><div><br></div><div>Hopefully my explanation above makes this clearer. Replacing the cgroup mounts set up by the container manager before exec-ing systemd is one possible workaround for the fact docker creates the cgroup mounts read-only. As I understand it, systemd requires CAP_SYS_ADMIN anyway, and this gives us the privileges required to modify (or unmount and recreate) the cgroup mounts.</div><div><br></div><div>> > - When is it valid to manually manipulate container cgroups?</div>><br>> When you asked for your own delegated subtree first, see docs:<div>> <a href="https://systemd.io/CGROUP_DELEGATION" rel="noreferrer" target="_blank">https://systemd.io/CGROUP_DELEGATION</a></div><div><br></div><div>Yep, I have read that multiple times, the following questions elaborate on the point about whether container managers are considering the container cgroups 'delegated' from their perspective and whether they're correctly using systemd delegation. I realise this is probably more of a question for docker/podman.<div><div><br></div><div>> > - Do container managers such as Docker and Podman correctly delegate cgroups on hosts running Systemd?<br>><br>> podman probably does this correctly. docker didn't do, not sure if that changed.<br></div></div><div><br></div><div>My guess is that this might relate to the container's 'cgroup manager/driver' corresponding to podman's '--cgroup-manager=systemd' arg, discussed at <a href="https://www.lewisgaul.co.uk/blog/coding/2022/05/13/cgroups-intro/#cgroup-driver-options">https://www.lewisgaul.co.uk/blog/coding/2022/05/13/cgroups-intro/#cgroup-driver-options</a>. If so, I believe docker has switched back to 'systemd' being the default under cgroups v2.</div><div><br></div><div>> - Are these container managers happy for the container to take ownership of the container’s cgroup?<br>><br>> I am not sure I grok this question, but a correctly implemented container manager should be able to safely run cgroups-using payloads inside the container. In that model, a host systemd manages the root of the tree, the container manager a cgroup further down, and the payload of the container (for example another systemd run inside the container) the stuff below.<br><br></div><div>You have answered my question at least from the theoretical side, thanks, this answer was what I had expected.</div><div><br></div><div>> > - Why are the container’s cgroup limits not set on a parent cgroup under Docker/Podman?<br>><br>> I don't grok the question?<br></div><div><div>><br>> > - Why doesn’t Docker use another layer of indirection in the cgroup hierarchy such that the limit is applied in the parent cgroup to the container?<br>><br>> I don't understand the question. And I can't answer docker questions.<br></div><div><br></div></div><div>This is explained at <a href="https://www.lewisgaul.co.uk/blog/coding/rough/2022/05/20/cgroups-questions/#why-are-the-containers-cgroup-limits-not-set-on-a-parent-cgroup-under-dockerpodman">https://www.lewisgaul.co.uk/blog/coding/rough/2022/05/20/cgroups-questions/#why-are-the-containers-cgroup-limits-not-set-on-a-parent-cgroup-under-dockerpodman</a>. I'm basically questioning why a cgroup limit applied by e.g. 'docker run --memory=20000000' is applied in a cgroup that is made available in/delegated to the container, such that the container is able to modify its own limit (if it has write access). It feels like there's a missing cgroup layer in this setup. If others agree with this assessment then I would be happy to bring it up on the docker/podman issue trackers.</div><div><br></div><div>> > - What happens if you have two of the same cgroup mount?<br>><br>> what do you mean by a "cgroup mount"? A cgroupfs controller mount? If they are within the same cgroup namespace they will be effectively bind mounts of each other, i.e. show the exact same contents. <br></div><div><br></div><div>Yes that's what I meant, and this confirms what I believed to be the case, thanks.</div><div><br></div><div>> > - Are there any gotchas/concerns around manipulating cgroups via multiple mount points?<br>><br>> Why would you do that though?<br></div><div><br></div><div>I'm not sure, I'm just trying to better understand how cgroups work and what's going on when creating/manipulating cgroup mounts.</div><div><br></div><div>> > - What’s the correct way to check which controllers are enabled?<br>> <br>> enabled *in* *what*? in the kernel? /proc/cgroups. Mounted? "mount" maybe? in your container mgr? depends on that.<br>></div><div>> > - What is it that determines which controllers are enabled? Is it kernel configuration applied at boot?<br>><br>> Enabled where?<br><br>I meant in the kernel, i.e. which controllers it's possible to create mounts for and use.<br><br>> > - Is it possible to have some controllers enabled for v1 at the same time as others are enabled for v2?<br>><br>> Yes.<br></div><div><br></div><div>Ah ok, that's interesting. So it's not technically possible to always be able to say "the host's active cgroup version is {1,2}", it would have to be on a per-controller basis such as "the cgroup memory controller is enabled on version {1,2}"? In practice is this likely to be a case that's encountered in the wild [on a host running systemd]?</div><div><br></div><div>Thanks,</div><div>Lewis</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, 21 May 2022 at 08:48, Lennart Poettering <<a href="mailto:lennart@poettering.net">lennart@poettering.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Fr, 20.05.22 17:12, Lewis Gaul (<a href="mailto:lewis.gaul@gmail.com" target="_blank">lewis.gaul@gmail.com</a>) wrote:<br> <br> > To summarize the questions (taken from the second post linked above):<br> > - Why are private cgroups mounted read-only in non-privileged<br> > containers?<br> <br> "private cgroups"? What do you mean by that? The controllers?<br> <br> Controller delegation on cgroupsv1 is simply not safe, that's all. You<br> can provide invalid configuration to the kernel, and DoS the machine<br> through it. cgroups are simply not a suitable privilege boundary on<br> cgroupsv1.<br> <br> If you want safe delegation, use cgroupsv2, where delegation is safe.<br> <br> > - Is it sound to override Docker’s mounting of the private container<br> > cgroups under v1?<br> <br> I don't know what Docker does these days, but they used to be entirely<br> ignorant towards safe cooperation in the cgroup tree. i.e. they<br> ignored <a href="https://systemd.io/CGROUP_DELEGATION" rel="noreferrer" target="_blank">https://systemd.io/CGROUP_DELEGATION</a> in its entirety, as they<br> don't really accepted systemd's existance.<br> <br> Today most distros I think switched over to other ways to run<br> containers, i.e. podman and so on, which have a more professional<br> approach to all this, and can safely cooperate in a cgroup tree.<br> <br> > - What are the concerns around the approach of passing '-v<br> > /sys/fs/cgroup:/sys/fs/cgroup' in terms of the container’s view of its<br> > cgroups?<br> <br> I don't know what this does. Is this a Docker thing?<br> <br> > - Is modifying/replacing the cgroup mounts set up by the container engine<br> > a reasonable workaround, or could this be fragile?<br> <br> I am not sure I follow? A workaround for what? One shouldn't assume<br> one even has the privs to modify cgroup mounts.<br> <br> But why would one even?<br> <br> > - When is it valid to manually manipulate container cgroups?<br> <br> When you asked for your own delegated subtree first, see docs:<br> <br> <a href="https://systemd.io/CGROUP_DELEGATION" rel="noreferrer" target="_blank">https://systemd.io/CGROUP_DELEGATION</a><br> <br> > - Do container managers such as Docker and Podman correctly delegate<br> > cgroups on hosts running Systemd?<br> <br> podman probably does this correctly. docker didn't do, not sure if<br> that changed.<br> <br> > - Are these container managers happy for the container to take ownership<br> > of the container’s cgroup?<br> <br> I am not sure I grok this question, but a correctly implemented<br> container manager should be able to safely run cgroups-using payloads<br> inside the container. In that model, a host systemd manages the root<br> of the tree, the container manager a cgroup further down, and the<br> payload of the container (for example another systemd run inside the<br> container) the stuff below.<br> <br> > - Why are the container’s cgroup limits not set on a parent cgroup under<br> > Docker/Podman?<br> <br> I don't grok the question?<br> <br> > - Why doesn’t Docker use another layer of indirection in the cgroup<br> > hierarchy such that the limit is applied in the parent cgroup to the<br> > container?<br> <br> I don't understand the question. And I can't answer docker questions.<br> <br> > - What happens if you have two of the same cgroup mount?<br> <br> what do you mean by a "cgroup mount"? A cgroupfs controller mount? If<br> they are within the same cgroup namespace they will be effectively<br> bind mounts of each other, i.e. show the exact same contents.<br> <br> > - Are there any gotchas/concerns around manipulating cgroups via multiple<br> > mount points?<br> <br> Why would you do that though?<br> <br> > - What’s the correct way to check which controllers are enabled?<br> <br> enabled *in* *what*? in the kernel? /proc/cgroups. Mounted? "mount"<br> maybe? in your container mgr? depends on that.<br> <br> > - What is it that determines which controllers are enabled? Is it kernel<br> > configuration applied at boot?<br> <br> Enabled where?<br> <br> > - Is it possible to have some controllers enabled for v1 at the same time<br> > as others are enabled for v2?<br> <br> Yes.<br> <br> Lennart<br> <br> --<br> Lennart Poettering, Berlin<br> </blockquote></div></div></div>