[systemd-devel] Questions around cgroups, systemd, containers

Sat May 21 18:02:14 UTC 2022

Hi Lennart,

Thanks for responding to the questions. I realise some of them may have
been a little unclear in isolation - my intention was for the two posts I
linked to provide the full context, but I understand they contain a lot of
text that it's unreasonable to expect people to have time to read! I'll try
to clarify for each question below.

> > - Why are private cgroups mounted read-only in
non-privileged containers?
>
> "private cgroups"? What do you mean by that? The controllers?
>
> Controller delegation on cgroupsv1 is simply not safe, that's all. You
can provide invalid configuration to the kernel, and DoS the machine
through it. cgroups are simply not a suitable privilege boundary on
cgroupsv1.
>
> If you want safe delegation, use cgroupsv2, where delegation is safe.

I was referring to the behaviour of '--cgroupns=private' (to 'docker run'
or 'podman run') where a cgroup namespace is created for the container.
This flag exists under v1 and v2 cgroups. For example, on v2 cgroups the
host cgroup path '/sys/fs/cgroup/docker/<ctr>/' would correspond to
'/sys/fs/cgroup/' inside the container. Discussed more at
https://www.lewisgaul.co.uk/blog/coding/2022/05/13/cgroups-intro/#cgroup-namespace-options
.

The question was "why is the cgroupfs mounted read-only inside a container
in non-privileged?" - when there's a cgroup namespace it seems it should be
safe [under v2 cgroups] for the container to have write access to its
cgroupfs?

The reason for caring about this of course is that it's a requirement for
running systemd inside the container. Currently workarounds are required,
such as '-v /sys/fs/cgroup:/sys/fs/cgroup' (which cannot be expected to
work with a cgroup namespace!) or podman's default '--systemd=true'
behaviour of detecting whether systemd is the entrypoint when deciding
whether to make the cgroupfs writable. However, I'm trying to understand if
there's any good reason for docker/podman not making the container's
cgroupfs read-write by default.

> > - Is it sound to override Docker’s mounting of the private
container cgroups under v1?
>
> I don't know what Docker does these days, but they used to be entirely
ignorant towards safe cooperation in the cgroup tree. i.e. they ignored
https://systemd.io/CGROUP_DELEGATION in its entirety, as they don't really
accepted systemd's existance.
>
> Today most distros I think switched over to other ways to run containers,
i.e. podman and so on, which have a more professional approach to all this,
and can safely cooperate in a cgroup tree.

This question does actually apply to podman too. It might be more
appropriately aimed at docker/podman rather than systemd, I was just
wondering if anyone had thoughts.

To rephrase/provide some more context - I have a use-case where a custom
bash script is our container entrypoint, where the purpose of the script is
to check a few things while still being able to exit the container, and at
the end of the script systemd is started (with 'exec /sbin/init'). Since
systemd requires write access to the cgroupfs, I was wondering if we could
just unmount and recreate the cgroup mount(s) as read-write in this
entrypoint script (requiring CAP_SYS_ADMIN to do so of course), overriding
the container manager's setup of making the mounts read-only.

> >   - What are the concerns around the approach of passing '-v
/sys/fs/cgroup:/sys/fs/cgroup' in terms of the container’s view of its
cgroups?
>
> I don't know what this does. Is this a Docker thing?

It's a workaround suggested for getting systemd running inside a docker
container, overriding docker's behaviour of making the cgroupfs mounts
read-only to make them available read-write. There are some references at
https://www.lewisgaul.co.uk/blog/coding/2022/05/13/cgroups-intro/#systemd-inside-docker-containers
.

This workaround seems quite undesirable to me considering it gives full
access to the host's cgroupfs and breaks '--cgroupns=private'. This is not
needed with podman since '--systemd=always' can be used. But the motivation
of the point/question above was to remove the requirement for this docker
workaround.

> >   - Is modifying/replacing the cgroup mounts set up by the container
engine a reasonable workaround, or could this be fragile?
>
> I am not sure I follow? A workaround for what? One shouldn't assume one
even has the privs to modify cgroup mounts.
>
> But why would one even?

Hopefully my explanation above makes this clearer. Replacing the cgroup
mounts set up by the container manager before exec-ing systemd is one
possible workaround for the fact docker creates the cgroup mounts
read-only. As I understand it, systemd requires CAP_SYS_ADMIN anyway, and
this gives us the privileges required to modify (or unmount and recreate)
the cgroup mounts.

> > - When is it valid to manually manipulate container cgroups?
>
> When you asked for your own delegated subtree first, see docs:
> https://systemd.io/CGROUP_DELEGATION

Yep, I have read that multiple times, the following questions elaborate on
the point about whether container managers are considering the container
cgroups 'delegated' from their perspective and whether they're correctly
using systemd delegation. I realise this is probably more of a question for
docker/podman.

> >   - Do container managers such as Docker and Podman correctly delegate
cgroups on hosts running Systemd?
>
> podman probably does this correctly. docker didn't do, not sure if that
changed.

My guess is that this might relate to the container's 'cgroup
manager/driver' corresponding to podman's '--cgroup-manager=systemd' arg,
discussed at
https://www.lewisgaul.co.uk/blog/coding/2022/05/13/cgroups-intro/#cgroup-driver-options.
If so, I believe docker has switched back to 'systemd' being the default
under cgroups v2.

>   - Are these container managers happy for the container to take
ownership of the container’s cgroup?
>
> I am not sure I grok this question, but a correctly implemented container
manager should be able to safely run cgroups-using payloads inside the
container. In that model, a host systemd manages the root of the tree, the
container manager a cgroup further down, and the payload of the container
(for example another systemd run inside the container) the stuff below.

You have answered my question at least from the theoretical side, thanks,
this answer was what I had expected.

> > - Why are the container’s cgroup limits not set on a parent cgroup
under Docker/Podman?
>
> I don't grok the question?
>
> >   - Why doesn’t Docker use another layer of indirection in the
cgroup hierarchy such that the limit is applied in the parent cgroup to
the container?
>
> I don't understand the question. And I can't answer docker questions.

This is explained at
https://www.lewisgaul.co.uk/blog/coding/rough/2022/05/20/cgroups-questions/#why-are-the-containers-cgroup-limits-not-set-on-a-parent-cgroup-under-dockerpodman.
I'm basically questioning why a cgroup limit applied by e.g. 'docker run
--memory=20000000' is applied in a cgroup that is made available
in/delegated to the container, such that the container is able to modify
its own limit (if it has write access). It feels like there's a missing
cgroup layer in this setup. If others agree with this assessment then I
would be happy to bring it up on the docker/podman issue trackers.

> > - What happens if you have two of the same cgroup mount?
>
> what do you mean by a "cgroup mount"? A cgroupfs controller mount? If
they are within the same cgroup namespace they will be effectively bind
mounts of each other, i.e. show the exact same contents.

Yes that's what I meant, and this confirms what I believed to be the case,
thanks.

> >   - Are there any gotchas/concerns around manipulating cgroups via
multiple mount points?
>
> Why would you do that though?

I'm not sure, I'm just trying to better understand how cgroups work and
what's going on when creating/manipulating cgroup mounts.

> > - What’s the correct way to check which controllers are enabled?
>
> enabled *in* *what*? in the kernel? /proc/cgroups. Mounted? "mount"
maybe? in your container mgr? depends on that.
>
> >   - What is it that determines which controllers are enabled? Is it
kernel configuration applied at boot?
>
> Enabled where?

I meant in the kernel, i.e. which controllers it's possible to create
mounts for and use.

> >   - Is it possible to have some controllers enabled for v1 at the same
time as others are enabled for v2?
>
> Yes.

Ah ok, that's interesting. So it's not technically possible to always be
able to say "the host's active cgroup version is {1,2}", it would have to
be on a per-controller basis such as "the cgroup memory controller is
enabled on version {1,2}"? In practice is this likely to be a case that's
encountered in the wild [on a host running systemd]?

Thanks,
Lewis

On Sat, 21 May 2022 at 08:48, Lennart Poettering <lennart at poettering.net>
wrote:

> On Fr, 20.05.22 17:12, Lewis Gaul (lewis.gaul at gmail.com) wrote:
>
> > To summarize the questions (taken from the second post linked above):
> > - Why are private cgroups mounted read-only in non-privileged
> > containers?
>
> "private cgroups"? What do you mean by that? The controllers?
>
> Controller delegation on cgroupsv1 is simply not safe, that's all. You
> can provide invalid configuration to the kernel, and DoS the machine
> through it. cgroups are simply not a suitable privilege boundary on
> cgroupsv1.
>
> If you want safe delegation, use cgroupsv2, where delegation is safe.
>
> > - Is it sound to override Docker’s mounting of the private container
> > cgroups under v1?
>
> I don't know what Docker does these days, but they used to be entirely
> ignorant towards safe cooperation in the cgroup tree. i.e. they
> ignored https://systemd.io/CGROUP_DELEGATION in its entirety, as they
> don't really accepted systemd's existance.
>
> Today most distros I think switched over to other ways to run
> containers, i.e. podman and so on, which have a more professional
> approach to all this, and can safely cooperate in a cgroup tree.
>
> >   - What are the concerns around the approach of passing '-v
> > /sys/fs/cgroup:/sys/fs/cgroup' in terms of the container’s view of its
> > cgroups?
>
> I don't know what this does. Is this a Docker thing?
>
> >   - Is modifying/replacing the cgroup mounts set up by the container
> engine
> > a reasonable workaround, or could this be fragile?
>
> I am not sure I follow? A workaround for what? One shouldn't assume
> one even has the privs to modify cgroup mounts.
>
> But why would one even?
>
> > - When is it valid to manually manipulate container cgroups?
>
> When you asked for your own delegated subtree first, see docs:
>
> https://systemd.io/CGROUP_DELEGATION
>
> >   - Do container managers such as Docker and Podman correctly delegate
> > cgroups on hosts running Systemd?
>
> podman probably does this correctly. docker didn't do, not sure if
> that changed.
>
> >   - Are these container managers happy for the container to take
> ownership
> > of the container’s cgroup?
>
> I am not sure I grok this question, but a correctly implemented
> container manager should be able to safely run cgroups-using payloads
> inside the container. In that model, a host systemd manages the root
> of the tree, the container manager a cgroup further down, and the
> payload of the container (for example another systemd run inside the
> container) the stuff below.
>
> > - Why are the container’s cgroup limits not set on a parent cgroup under
> > Docker/Podman?
>
> I don't grok the question?
>
> >   - Why doesn’t Docker use another layer of indirection in the cgroup
> > hierarchy such that the limit is applied in the parent cgroup to the
> > container?
>
> I don't understand the question. And I can't answer docker questions.
>
> > - What happens if you have two of the same cgroup mount?
>
> what do you mean by a "cgroup mount"? A cgroupfs controller mount? If
> they are within the same cgroup namespace they will be effectively
> bind mounts of each other, i.e. show the exact same contents.
>
> >   - Are there any gotchas/concerns around manipulating cgroups via
> multiple
> > mount points?
>
> Why would you do that though?
>
> > - What’s the correct way to check which controllers are enabled?
>
> enabled *in* *what*? in the kernel? /proc/cgroups. Mounted? "mount"
> maybe? in your container mgr? depends on that.
>
> >   - What is it that determines which controllers are enabled? Is it
> kernel
> > configuration applied at boot?
>
> Enabled where?
>
> >   - Is it possible to have some controllers enabled for v1 at the same
> time
> > as others are enabled for v2?
>
> Yes.
>
> Lennart
>
> --
> Lennart Poettering, Berlin
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/systemd-devel/attachments/20220521/4d3d5fa5/attachment-0001.htm>