[systemd-devel] systemd-nspawn containers

Fri Nov 11 17:28:59 UTC 2016

On Fri, 11.11.16 16:41, Michał Zegan (webczat_200 at poczta.onet.pl) wrote:

> Thank you for your answers!
> 
> What I meant by secure containers is mostly, containers that are or will
> be secure enough to use them for things like virtual private server
> hosting. Is nspawn intended to be usable for such things in the future,
> or maybe it already is, or whatever?

I run my own server this way, already as an exercise of dogfooding.

So, yes, running a VPS like this certainly works, but do note that
nspawn doesn't do orchestration or anything. It's good enough for me,
but if you needy fancy orchestration tools then nspawn won't be
sufficient.

> What kernel limitations do you mean when you say about security?

Well, a lot of subsystems cannot be locked down properly for use in
containers yet. You can lock down a lot, in particular if you use
userns, but there are still a lot of holes in there, and in particular
userns itself has been a major source of CVEs alone in the most recent
kernels.

Right now, "containers" in general are not about security. Some
companies claim they were secure, but they really aren't. And that's
not a bug in nspawn, or docker, or lxc for that matter, it's simply a
limiation of the kernel.

Or to say this differently: we'll do in nspawn everything we can to
lock things down properly, but there are limits based on what the
kernel provides... As the kernel gets improved in this area, we'll
update nspawn to make use of it. We are sitting in the same boat in
this regard as others container managers, and they have the same
limits more or less we have.

> For now I know that in full containers with userns file capabilities do
> not work (I think), you have no virtualized /proc/meminfo and friends
> (do cgroup namespaces give a chance to change that?), you cannot mknod
> devices (no whitelist possible at this level), no fuse support, no
> automatic uid shifting kernel level, no possibility to mount physical
> filesystems in userns, and no possibility to have selinux/etc per
> container. Do you mean such limitations or something else?

Well, devices are not virtualized at all (with the exception of
network devices), that means no udev, not hotplug events and so
on. Some container managers ignore this, and provide access to
selected device nodes anyway, but we don't do something like that in
nspawn, since it's pretty broken (as /sys wouldn't match what you see
in /dev). In general, I think people should just accept that
containers mean "you don't get physical device access". And if you
want physical device access, then don't use containers...

> I am interested in this topic but it is quite hard for me to track
> progress in that area (kernel side) even though I subscribe in some
> kernel ml's and know at least about submitted patches, or some of
> them. What else is missing that I didn't say about that would be
> good to have?

Well, a lot of stuff is still not properly virtualized. To mind come
audit, autofs, keyring, cgroups, …

> Also what about setting cgroup parameters per container? nspawn does not
> allow doing that, and you probably do not intent it to be done by
> overriding container's scope unit settings, for example?

You can actually do that just fine. Simply set it in the nspawn  service
file. Or if you run nspawn from the cmdline with the "-p" switch. Or
make your changes dynamically via "systemctl set-property". It's all
supported and works well.

Lennart

-- 
Lennart Poettering, Red Hat