[systemd-devel] Questions around cgroups, systemd, containers

Sat May 21 07:48:34 UTC 2022

On Fr, 20.05.22 17:12, Lewis Gaul (lewis.gaul at gmail.com) wrote:

> To summarize the questions (taken from the second post linked above):
> - Why are private cgroups mounted read-only in non-privileged
> containers?

"private cgroups"? What do you mean by that? The controllers?

Controller delegation on cgroupsv1 is simply not safe, that's all. You
can provide invalid configuration to the kernel, and DoS the machine
through it. cgroups are simply not a suitable privilege boundary on
cgroupsv1.

If you want safe delegation, use cgroupsv2, where delegation is safe.

> - Is it sound to override Docker’s mounting of the private container
> cgroups under v1?

I don't know what Docker does these days, but they used to be entirely
ignorant towards safe cooperation in the cgroup tree. i.e. they
ignored https://systemd.io/CGROUP_DELEGATION in its entirety, as they
don't really accepted systemd's existance.

Today most distros I think switched over to other ways to run
containers, i.e. podman and so on, which have a more professional
approach to all this, and can safely cooperate in a cgroup tree.

>   - What are the concerns around the approach of passing '-v
> /sys/fs/cgroup:/sys/fs/cgroup' in terms of the container’s view of its
> cgroups?

I don't know what this does. Is this a Docker thing?

>   - Is modifying/replacing the cgroup mounts set up by the container engine
> a reasonable workaround, or could this be fragile?

I am not sure I follow? A workaround for what? One shouldn't assume
one even has the privs to modify cgroup mounts.

But why would one even?

> - When is it valid to manually manipulate container cgroups?

When you asked for your own delegated subtree first, see docs:

https://systemd.io/CGROUP_DELEGATION

>   - Do container managers such as Docker and Podman correctly delegate
> cgroups on hosts running Systemd?

podman probably does this correctly. docker didn't do, not sure if
that changed.

>   - Are these container managers happy for the container to take ownership
> of the container’s cgroup?

I am not sure I grok this question, but a correctly implemented
container manager should be able to safely run cgroups-using payloads
inside the container. In that model, a host systemd manages the root
of the tree, the container manager a cgroup further down, and the
payload of the container (for example another systemd run inside the
container) the stuff below.

> - Why are the container’s cgroup limits not set on a parent cgroup under
> Docker/Podman?

I don't grok the question?

>   - Why doesn’t Docker use another layer of indirection in the cgroup
> hierarchy such that the limit is applied in the parent cgroup to the
> container?

I don't understand the question. And I can't answer docker questions.

> - What happens if you have two of the same cgroup mount?

what do you mean by a "cgroup mount"? A cgroupfs controller mount? If
they are within the same cgroup namespace they will be effectively
bind mounts of each other, i.e. show the exact same contents.

>   - Are there any gotchas/concerns around manipulating cgroups via multiple
> mount points?

Why would you do that though?

> - What’s the correct way to check which controllers are enabled?

enabled *in* *what*? in the kernel? /proc/cgroups. Mounted? "mount"
maybe? in your container mgr? depends on that.

>   - What is it that determines which controllers are enabled? Is it kernel
> configuration applied at boot?

Enabled where?

>   - Is it possible to have some controllers enabled for v1 at the same time
> as others are enabled for v2?

Yes.

Lennart

--
Lennart Poettering, Berlin