[systemd-devel] Why does nspawn need two child processes?
Luke Shumaker
lukeshu at lukeshu.com
Thu Jun 1 00:40:38 UTC 2017
Hi all,
I have a question about `systemd-nspawn` internals.
When creating the child process, it does something like:
parent
|
clone(MOUNT)
| `------------,
| outer_child()
| |
| clone(rest)
| | `------------,
| return inner_child()
| ,-----------' |
wait() |
| exec()
| |||
| exit()
| ,----------------------------'
wait()
where in the first `clone()` it unshares the mount namespace, and in
the second `clone()` it unshares all of the other namespaces (except
for the cgroup namespace).
Initially, I was confused by the awkward dance with having two
children; I couldn't imagine a reason why it is necessary to do this
with a separate `inner_child` and `outer_child`; why can't everything
be done in a single child process?:
parent
|
clone(MOUNT)
| `------------,
| child()
| |
| unshare(rest)
| |
| exec()
| |||
| exit()
| ,------------'
wait()
It has used the current two-child approach since user-namespace
support was first completed in 03cfe0d5, which only has the brief
commit message "nspawn: finish user namespace support"; so there
aren't too many clues to be found in the commit log.
Part of the answer lies in the behavior of `unshare(CLONE_NEWPID)`.
Unlike all of the other namespaces that may be unshared, calling
`unshare(CLONE_NEWPID)` doesn't actually unshare the PID namespace in
*this* process, it says to unshare the PID namespace at the next
`fork()`/`clone()` call. So even if we changed `systemd-nspawn` to
the `clone(MOUNT)/unshare(rest)` model, it would still have to
`clone()` (or plain `fork()` at that point) a second, inner, child
process.
So then, I'm left wondering why unsharing the PID namespace can't be
moved up to the initial `clone()`, allowing everything else to be
`unshare`(2)ed in the initial child process:
parent
|
clone(MOUNT|PID)
| `------------,
| child()
| |
| unshare(rest)
| |
| exec()
| |||
| exit()
| ,------------'
wait()
So my question becomes: what has to be done *after* unsharing the
mount namespace, but *before* unsharing the PID namespace?
--
Happy hacking,
~ Luke Shumaker
More information about the systemd-devel
mailing list