[systemd-devel] What makes systemd-nspawn "not suitable for secure container setups"?

Sun Apr 24 13:54:42 PDT 2011

On Fri, 22.04.11 19:55, Josh Triplett (josh at joshtriplett.org) wrote:

> The systemd-nspawn manpage lists the various mechanisms used to isolate
> the container, and then says "Note that even though these security
> precautions are taken systemd-nspawn is not suitable for secure
> container setups. Many of the security features may be circumvented and
> are hence primarily useful to avoid accidental changes to the host
> system from the container."
> 
> How can a process in a systemd-nspawn container circumvent the container
> setup?  What additional steps would systemd-nspawn need to take to
> provide a secure container setup?

Well, the question is of course what "secure" actually means...

But here's why I put this sentence in the man page:

First of all, we don't virtualize AF_UNIX abstract namespace sockets. It
is part of the network virtualization, and I explicitly decided not do
virtualize that, to simplify things, since otherwise containers need
specific network configuration, and they'd be much harder to use hence
than chroots, but the simplicity to use of chroot is what I was heading for.

Ideally AF_UNIX virtulaization would not be part of CLONE_NEWNET but of
CLONE_NEWIPC, since it is a local IPC interface, and has nothing to do
with the network, but I guess that's too late now.

Fortunately not many services use abstract namespace sockets, since they
are insecure and mostly unnecessary in most cases these days. There are
a few exceptions though: some services use randomly named unix
sockets. And there's udev. Since we don't want to run a second udev in
the container we actually benefit from this here: only the host udev can
bind the socket, hence the container udev will immediately fail.

The missing virtualization of the abstarct namespace means processes can
talk to services outside of the namespace. This has obvious
problems. And a couple of non-obvious ones on top: SCM_CREDENTIALS will
be weird due to the non-matching users and stuff.

When we enter the container we drop all capabilities, except the following:

CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_DAC_READ_SEARCH, CAP_FOWNER,
CAP_FSETID, CAP_IPC_OWNER, CAP_KILL, CAP_LEASE, CAP_LINUX_IMMUTABLE,
CAP_NET_BIND_SERVICE, CAP_NET_BROADCAST, CAP_NET_RAW, CAP_SETGID,
CAP_SETFCAP, CAP_SETPCAP, CAP_SETUID, CAP_SYS_ADMIN, CAP_SYS_CHROOT,
CAP_SYS_NICE, CAP_SYS_PTRACE, CAP_SYS_TTY_CONFIG.

Due to the PID, fs and IPC namespacing a couple of these capabilities
should not be much of a problem. Except for a few cases:

- We don't virtualize the network for simplicity reasons, that means
  CAP_NET_BIND allows processes in the container to bind to any port,
  thus blocking stuff outside of the container to work. Now, it would be
  easy to remove this capability too, but this of course would still
  allow DoS high port services on the host from withing the
  container. (Consider the container blocking all ports > 6000 thus
  making it impossible to run X on the host). But this one is actually
  not a big issue in the end I guess, so let's ignore it here.

- CAP_NET_RAW means that the container can sniff into the host's traffic.

- CAP_SYS_ADMIN is a grab bag of things, and is the biggie here. With this
  the container can remount /sys, /selinux and /proc/sys read-writable
  and thus influence this host massively. It can disable swap
  partitions, too, and lots and lots of other things, too.

- A couple of the FS related operations might be problematic since the
  abstract namespace sockets are not virtualized, and thus you could do
  privileged operations on fds from outside the container.

There's also currently no virtualization of the users. That means
RLIMIT_NPROC and stuff when applied in the container will also affect
the same user outside of the container. That's pretty bad...

Some of these issues require kernel support to fix properly (for example
the RLIMIT_NPROC issue). Other's we could fix in userspace probably. For
example, we might be able to make CAP_SYS_ADMIN unnecessary if we
premount really everything in the container that it might need. systemd
is already smart enough to be happy with pre-mounted directories, not
entirely sure about sysvinit though. With a bit of work we probably
could even add CLONE_NEWNET support, and automatically set up a valid
virtualized net interface for the container, that could not be
reconfigurable by the container and is always forwarded to the host, but
which buys us AF_UNIX abstract namespace virtualization and fixes the
CAP_NET_BIND issue.

With CLONE_NEWUSER in place and these changes we could probably make
things reasonably secure. But especially figuring out a way to
virtualize the network in an elegant way so that things will continue to
"just work" is not going to be easy.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.