[systemd-devel] Why does nspawn need two child processes?
lukeshu at lukeshu.com
Thu Jun 1 00:40:38 UTC 2017
I have a question about `systemd-nspawn` internals.
When creating the child process, it does something like:
| | `------------,
| return inner_child()
| ,-----------' |
where in the first `clone()` it unshares the mount namespace, and in
the second `clone()` it unshares all of the other namespaces (except
for the cgroup namespace).
Initially, I was confused by the awkward dance with having two
children; I couldn't imagine a reason why it is necessary to do this
with a separate `inner_child` and `outer_child`; why can't everything
be done in a single child process?:
It has used the current two-child approach since user-namespace
support was first completed in 03cfe0d5, which only has the brief
commit message "nspawn: finish user namespace support"; so there
aren't too many clues to be found in the commit log.
Part of the answer lies in the behavior of `unshare(CLONE_NEWPID)`.
Unlike all of the other namespaces that may be unshared, calling
`unshare(CLONE_NEWPID)` doesn't actually unshare the PID namespace in
*this* process, it says to unshare the PID namespace at the next
`fork()`/`clone()` call. So even if we changed `systemd-nspawn` to
the `clone(MOUNT)/unshare(rest)` model, it would still have to
`clone()` (or plain `fork()` at that point) a second, inner, child
So then, I'm left wondering why unsharing the PID namespace can't be
moved up to the initial `clone()`, allowing everything else to be
`unshare`(2)ed in the initial child process:
So my question becomes: what has to be done *after* unsharing the
mount namespace, but *before* unsharing the PID namespace?
~ Luke Shumaker
More information about the systemd-devel