<div dir="ltr">Thanks for your detailed answer / explanation Lennart, it's fully consistent with my code-browsing findings.<div><br></div><div>I've been struggling myself with the problem that you alluded above to identify "foreign" mountpoints. After banging my head against the wall for a while i ended up implementing an heuristic based on the major:minor-number field of the /proc/pid/mountinfo file: if the container mountpoint being considered has a major:minor-id that matches those major:minor-ids present in the host mount namespace, then this one is likely a "foreign" mountpoint, and shouldn't be unmounted.</div><div><br></div><div>Obviously, this would force you to extend the current systemd mountInfo parser. And there is a caveat as not all file-systems make use of a unique / differentiated ID for every new mountpoint (e.g. "/dev/null" fs always use the same major:minor id across different mount namespaces), so there could be false-positives, but that doesn't represent a problem in our case. Here is the specific code if you want to check it out: <a href="https://github.com/nestybox/sysbox-fs/blob/master/mount/infoParser.go#L828">https://github.com/nestybox/sysbox-fs/blob/master/mount/infoParser.go#L828</a></div><div><br></div><div>Please let me know if you ever find a better approach.</div><div><br></div><div>cheers,</div><div><br></div><div>/Rodny</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Feb 24, 2021 at 9:19 AM Lennart Poettering <<a href="mailto:lennart@poettering.net">lennart@poettering.net</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Fr, 19.02.21 19:17, Rodny Molina (<a href="mailto:rodnymolina@gmail.com" target="_blank">rodnymolina@gmail.com</a>) wrote:<br> <br> > Hi,<br> ><br> > As part of a prototype I'm working on to run systemd within an unprivileged<br> > docker container, I would like to prevent mountpoints created at runtime<br> > from being unmounted during the container shutdown process. I understand<br> > that systemd creates "<blah>.mount" units dynamically for<br> > these mountpoints as they show up in /proc/pid/mountinfo, but after reading<br> > the docs + code, I don't see a way to avoid these unmounts during the<br> > shutdown.target execution.<br> <br> Yeah, it would be great if we could automatically determine "foreign<br> owned" mounts, and then step away from them. But there's really no way<br> for us to figure that out, at lesat to my knowledge. Ideally<br> /proc/self/mountinfo would tell us about this in some field, but it<br> really doesn't afaik.<br> <br> > Interestingly, I see that there's code<br> > <<a href="https://github.com/systemd/systemd/blob/main/src/shutdown/shutdown.c#L398" rel="noreferrer" target="_blank">https://github.com/systemd/systemd/blob/main/src/shutdown/shutdown.c#L398</a>><br> > that<br> > skips the unmounting cycle attending to the ConditionVirtualization /<br> > containeinarized settings, which is what I need, but I'm not able to see<br> > that code being called during the container shutdown -- probably i'm not<br> > understanding systemd's fsm unwinding logic well enough ...<br> <br> There are two phases of shutdown: the regular phase where we follow<br> mount unit deps, and stuff is umounted via /sbin/umount. i.e. where<br> the shutdown is handled by the usual unit logic.<br> <br> And then there's the second phase which shutdown.c implements: it's a<br> separate binary that PID 1 invokes via execve() (so that it becomes<br> new PID 1) and then pretty robustly just tries to<br> umount/detach/disassembles/… without understanding of dependencies<br> what might be left over.<br> <br> The first phase hence is the "clean" shutdown logic and the second<br> phase is the "dirty" fallback logic that tries really hard to sync/put<br> file systems into a clean state if the first phase fails (maybe<br> because some misplaced deps).<br> <br> The second phase is skipped in containers, the first one is not. The<br> second phase is unnecessary in containers since the container manager<br> and namespace cleanup take care of this anyway, and even if it didn't,<br> the host's shutdown logic can take responsibility of all this.<br> <br> Now, if the kernel would provide us with the info we'd generate the<br> deps for .mount units synthesized from /proc/self/mountinfo in a way<br> that "foreign owned" mounts won't get unmounted in phase 1, but we<br> simply can't do that automatically since we can't distinguish<br> them. :-(<br> <br> You could manually define .mount units for all units you know are<br> owned by the outside container manager, but that is nasty and<br> fragile. The mount units would have to carefully have the right deps<br> (or better: should miss the right deps) to ensure things are clean<br> when shutting down.<br> <br> So yeah, I#d love to fix this properly, generically, but this requires<br> some kernel work first, and that's not just a technical difficulty but<br> given the maintainer of said interfaces also a political one.<br> <br> Lennart<br> <br> --<br> Lennart Poettering, Berlin<br> </blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature">/Rodny</div>