[systemd-devel] Unable to run systemd in an LXC / cgroup container.

Michael H. Warfield mhw at WittsEnd.com
Mon Oct 22 08:48:41 PDT 2012


On Mon, 2012-10-22 at 16:11 +0200, Lennart Poettering wrote:
> On Sun, 21.10.12 17:25, Michael H. Warfield (mhw at WittsEnd.com) wrote:
> 
> > Hello,
> > 
> > This is being directed to the systemd-devel community but I'm cc'ing the
> > lxc-users community and the Fedora community on this for their input as
> > well.  I know it's not always good to cross post between multiple lists
> > but this is of interest to all three communities who may have valuable
> > input.
> > 
> > I'm new to this particular list, just having joined after tracking a
> > problem down to some systemd internals...
> > 
> > Several people over the last year or two on the lxc-users list have been
> > discussions trying to run certain distros (notably Fedora 16 and above,
> > recent Arch Linux and possibly others) in LXC containers, virualizing
> > entire servers this way.  This is very similar to Virtuoso / OpenVZ only
> > it's using the native Linux cgroups for the containers (primary reason I
> > dumped OpenVZ was to avoid their custom patched kernels).  These recent
> > distros have switched to systemd for the main init process and this has
> > proven to be disastrous for those of us using LXC and trying to install
> > or update our containers.

> Note that it is explicitly our intention to make running systemd inside
> of containers as smooth as possibly. The notes Kay linked summarize what
> the container manager needs to do for best integration.

> > To summarize the problem...  The LXC startup binary sets up various
> > things for /dev and /dev/pts for the container to run properly and this
> > works perfectly fine for SystemV start-up scripts and/or Upstart.
> > Unfortunately, systemd has mounts of devtmpfs on /dev and devpts
> > on /dev/pts which then break things horribly.  This is because the
> > kernel currently lacks namespaces for devices and won't for some time to
> > come (in design).  When devtmpfs gets mounted over top of /dev in the
> > container, it then hijacks the hosts console tty and several other
> > devices which had been set up through bind mounts by LXC and should have
> > been LEFT ALONE.

> Please initialize a minimal tmpfs on /dev. systemd will then work fine.

My containers have a reasonable /dev that work with Upstart just fine
but they are not on tmpfs.  Is mounting tmpfs on /dev and recreating
that minimal /dev required?

> > Yes!  I recognize that this problem with devtmpfs and lack of namespaces
> > is a potential security problem anyways that could (and does) cause
> > serious container-to-host problems.  We're just not going to get that
> > fixed right away in the linux cgroups and namespaces.

> No, devtmpfs really doesn't need updating, containers simply shouldn't
> use it.

Ok, yeah.  That seems to be at the heart of the problem we're trying to
solve.

> > How do we work around this problem in systemd where it has hard coded
> > mounts in the binary that we can't override or configure?  Or is it
> > there and I'm just missing it trying to examine the sources?  That's how
> > I found where the problem lay.

> systemd will make use of pre-existing mounts if they exist, and only
> mount something new if they don't exist.

So you're saying that, if we have something mounted on /dev, that's what
prevents systemd from mounting devtmpfs on /dev?  That could be
problematical.  Tested out a couple of options there that didn't work.
That's going to take some effort.

> Note that there are reports that LXC has issues with the fact that newer
> systemd enables shared mount propagation for all mounts by default (this
> should actually be beneficial for containers as this ensures that new
> mounts appear in the containers). LXC when run on such a system fails as
> soon as it tries to use pivot_root(), as that is incompatible with
> shared mount propagation. The needs fixing in LXC: it should use MS_MOVE
> or MS_BIND to place the new root dir in / instead. A short term
> work-around is to simply remount the root tree to private before
> invoking LXC.

But, I have systemd running on my host system (F17) and containers with
sysvinit or upstart inits are all starting just fine.  That sounds like
it should impact all containers as pivot_root() is issued before systemd
in the container is started.  Or am I missing something here?  That
sounds like a problem for Serge and others to investigate further.  I'll
see about trying that workaround though.

> Lennart

> -- 
> Lennart Poettering - Red Hat, Inc.

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 985-6132 |  mhw at WittsEnd.com
   /\/\|=mhw=|\/\/          | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9          | An optimist believes we live in the best of all
 PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure of it!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.freedesktop.org/archives/systemd-devel/attachments/20121022/8178cf38/attachment-0001.pgp>


More information about the systemd-devel mailing list