[systemd-devel] avoid unmounts in unprivileged containers
Rodny Molina
rodnymolina at gmail.com
Sat Feb 27 19:28:47 UTC 2021
Thanks for your detailed answer / explanation Lennart, it's fully
consistent with my code-browsing findings.
I've been struggling myself with the problem that you alluded above to
identify "foreign" mountpoints. After banging my head against the wall for
a while i ended up implementing an heuristic based on the
major:minor-number field of the /proc/pid/mountinfo file: if the container
mountpoint being considered has a major:minor-id that matches those
major:minor-ids present in the host mount namespace, then this one is
likely a "foreign" mountpoint, and shouldn't be unmounted.
Obviously, this would force you to extend the current systemd mountInfo
parser. And there is a caveat as not all file-systems make use of a unique
/ differentiated ID for every new mountpoint (e.g. "/dev/null" fs always
use the same major:minor id across different mount namespaces), so there
could be false-positives, but that doesn't represent a problem in our case.
Here is the specific code if you want to check it out:
https://github.com/nestybox/sysbox-fs/blob/master/mount/infoParser.go#L828
Please let me know if you ever find a better approach.
cheers,
/Rodny
On Wed, Feb 24, 2021 at 9:19 AM Lennart Poettering <lennart at poettering.net>
wrote:
> On Fr, 19.02.21 19:17, Rodny Molina (rodnymolina at gmail.com) wrote:
>
> > Hi,
> >
> > As part of a prototype I'm working on to run systemd within an
> unprivileged
> > docker container, I would like to prevent mountpoints created at runtime
> > from being unmounted during the container shutdown process. I understand
> > that systemd creates "<blah>.mount" units dynamically for
> > these mountpoints as they show up in /proc/pid/mountinfo, but after
> reading
> > the docs + code, I don't see a way to avoid these unmounts during the
> > shutdown.target execution.
>
> Yeah, it would be great if we could automatically determine "foreign
> owned" mounts, and then step away from them. But there's really no way
> for us to figure that out, at lesat to my knowledge. Ideally
> /proc/self/mountinfo would tell us about this in some field, but it
> really doesn't afaik.
>
> > Interestingly, I see that there's code
> > <
> https://github.com/systemd/systemd/blob/main/src/shutdown/shutdown.c#L398>
> > that
> > skips the unmounting cycle attending to the ConditionVirtualization /
> > containeinarized settings, which is what I need, but I'm not able to see
> > that code being called during the container shutdown -- probably i'm not
> > understanding systemd's fsm unwinding logic well enough ...
>
> There are two phases of shutdown: the regular phase where we follow
> mount unit deps, and stuff is umounted via /sbin/umount. i.e. where
> the shutdown is handled by the usual unit logic.
>
> And then there's the second phase which shutdown.c implements: it's a
> separate binary that PID 1 invokes via execve() (so that it becomes
> new PID 1) and then pretty robustly just tries to
> umount/detach/disassembles/… without understanding of dependencies
> what might be left over.
>
> The first phase hence is the "clean" shutdown logic and the second
> phase is the "dirty" fallback logic that tries really hard to sync/put
> file systems into a clean state if the first phase fails (maybe
> because some misplaced deps).
>
> The second phase is skipped in containers, the first one is not. The
> second phase is unnecessary in containers since the container manager
> and namespace cleanup take care of this anyway, and even if it didn't,
> the host's shutdown logic can take responsibility of all this.
>
> Now, if the kernel would provide us with the info we'd generate the
> deps for .mount units synthesized from /proc/self/mountinfo in a way
> that "foreign owned" mounts won't get unmounted in phase 1, but we
> simply can't do that automatically since we can't distinguish
> them. :-(
>
> You could manually define .mount units for all units you know are
> owned by the outside container manager, but that is nasty and
> fragile. The mount units would have to carefully have the right deps
> (or better: should miss the right deps) to ensure things are clean
> when shutting down.
>
> So yeah, I#d love to fix this properly, generically, but this requires
> some kernel work first, and that's not just a technical difficulty but
> given the maintainer of said interfaces also a political one.
>
> Lennart
>
> --
> Lennart Poettering, Berlin
>
--
/Rodny
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/systemd-devel/attachments/20210227/bbc8382a/attachment.htm>
More information about the systemd-devel
mailing list