[systemd-devel] Why does nspawn need two child processes?

Thu Jun 1 00:40:38 UTC 2017

Hi all,

I have a question about `systemd-nspawn` internals.

When creating the child process, it does something like:

      parent
        |
    clone(MOUNT)
        |  `------------,
        |          outer_child()
        |               |
        |           clone(rest)
        |               |  `------------,
        |            return        inner_child()
        |  ,-----------'                |
      wait()                            |
        |                             exec()
        |                              |||
        |                             exit()
        |  ,----------------------------'
      wait()

where in the first `clone()` it unshares the mount namespace, and in
the second `clone()` it unshares all of the other namespaces (except
for the cgroup namespace).

Initially, I was confused by the awkward dance with having two
children; I couldn't imagine a reason why it is necessary to do this
with a separate `inner_child` and `outer_child`; why can't everything
be done in a single child process?:

      parent
        |
    clone(MOUNT)
        |  `------------,
        |            child()
        |               |
        |          unshare(rest)
        |               |
        |             exec()
        |              |||
        |             exit()
        |  ,------------'
      wait()

It has used the current two-child approach since user-namespace
support was first completed in 03cfe0d5, which only has the brief
commit message "nspawn: finish user namespace support"; so there
aren't too many clues to be found in the commit log.

Part of the answer lies in the behavior of `unshare(CLONE_NEWPID)`.
Unlike all of the other namespaces that may be unshared, calling
`unshare(CLONE_NEWPID)` doesn't actually unshare the PID namespace in
*this* process, it says to unshare the PID namespace at the next
`fork()`/`clone()` call.  So even if we changed `systemd-nspawn` to
the `clone(MOUNT)/unshare(rest)` model, it would still have to
`clone()` (or plain `fork()` at that point) a second, inner, child
process.

So then, I'm left wondering why unsharing the PID namespace can't be
moved up to the initial `clone()`, allowing everything else to be
`unshare`(2)ed in the initial child process:

      parent
        |
    clone(MOUNT|PID)
        |  `------------,
        |            child()
        |               |
        |          unshare(rest)
        |               |
        |             exec()
        |              |||
        |             exit()
        |  ,------------'
      wait()

So my question becomes: what has to be done *after* unsharing the
mount namespace, but *before* unsharing the PID namespace?

-- 
Happy hacking,
~ Luke Shumaker