[systemd-devel] avoid unmounts in unprivileged containers

Wed Feb 24 17:19:53 UTC 2021

On Fr, 19.02.21 19:17, Rodny Molina (rodnymolina at gmail.com) wrote:

> Hi,
>
> As part of a prototype I'm working on to run systemd within an unprivileged
> docker container, I would like to prevent mountpoints created at runtime
> from being unmounted during the container shutdown process. I understand
> that systemd creates "<blah>.mount" units dynamically for
> these mountpoints as they show up in /proc/pid/mountinfo, but after reading
> the docs + code, I don't see a way to avoid these unmounts during the
> shutdown.target execution.

Yeah, it would be great if we could automatically determine "foreign
owned" mounts, and then step away from them. But there's really no way
for us to figure that out, at lesat to my knowledge. Ideally
/proc/self/mountinfo would tell us about this in some field, but it
really doesn't afaik.

> Interestingly, I see that there's code
> <https://github.com/systemd/systemd/blob/main/src/shutdown/shutdown.c#L398>
> that
> skips the unmounting cycle attending to the ConditionVirtualization /
> containeinarized settings, which is what I need, but I'm not able to see
> that code being called during the container shutdown -- probably i'm not
> understanding systemd's fsm unwinding logic well enough ...

There are two phases of shutdown: the regular phase where we follow
mount unit deps, and stuff is umounted via /sbin/umount. i.e. where
the shutdown is handled by the usual unit logic.

And then there's the second phase which shutdown.c implements: it's a
separate binary that PID 1 invokes via execve() (so that it becomes
new PID 1) and then pretty robustly just tries to
umount/detach/disassembles/… without understanding of dependencies
what might be left over.

The first phase hence is the "clean" shutdown logic and the second
phase is the "dirty" fallback logic that tries really hard to sync/put
file systems into a clean state if the first phase fails (maybe
because some misplaced deps).

The second phase is skipped in containers, the first one is not. The
second phase is unnecessary in containers since the container manager
and namespace cleanup take care of this anyway, and even if it didn't,
the host's shutdown logic can take responsibility of all this.

Now, if the kernel would provide us with the info we'd generate the
deps for .mount units synthesized from /proc/self/mountinfo in a way
that "foreign owned" mounts won't get unmounted in phase 1, but we
simply can't do that automatically since we can't distinguish
them. :-(

You could manually define .mount units for all units you know are
owned by the outside container manager, but that is nasty and
fragile. The mount units would have to carefully have the right deps
(or better: should miss the right deps) to ensure things are clean
when shutting down.

So yeah, I#d love to fix this properly, generically, but this requires
some kernel work first, and that's not just a technical difficulty but
given the maintainer of said interfaces also a political one.

Lennart

--
Lennart Poettering, Berlin