[systemd-devel] [Lxc-users] Unable to run systemd in an LXC / cgroup container.

Michael H. Warfield mhw at WittsEnd.com
Mon Oct 22 06:04:34 PDT 2012


On Mon, 2012-10-22 at 09:06 +0100, John wrote:
> On 22/10/12 03:06, Michael H. Warfield wrote:
> > On Mon, 2012-10-22 at 02:53 +0200, Kay Sievers wrote:
> >> On Sun, Oct 21, 2012 at 11:25 PM, Michael H. Warfield <mhw at wittsend.com> wrote:
> >>> This is being directed to the systemd-devel community but I'm cc'ing the
> >>> lxc-users community and the Fedora community on this for their input as
> >>> well.  I know it's not always good to cross post between multiple lists
> >>> but this is of interest to all three communities who may have valuable
> >>> input.
> >>>
> >>> I'm new to this particular list, just having joined after tracking a
> >>> problem down to some systemd internals...
> >>>
> >>> Several people over the last year or two on the lxc-users list have been
> >>> discussions trying to run certain distros (notably Fedora 16 and above,
> >>> recent Arch Linux and possibly others) in LXC containers, virualizing
> >>> entire servers this way.  This is very similar to Virtuoso / OpenVZ only
> >>> it's using the native Linux cgroups for the containers (primary reason I
> >>> dumped OpenVZ was to avoid their custom patched kernels).  These recent
> >>> distros have switched to systemd for the main init process and this has
> >>> proven to be disastrous for those of us using LXC and trying to install
> >>> or update our containers.
> >>>
> >>> To put it bluntly, it doesn't work and causes all sorts of problems on
> >>> the host.
> >>>
> >>> To summarize the problem...  The LXC startup binary sets up various
> >>> things for /dev and /dev/pts for the container to run properly and this
> >>> works perfectly fine for SystemV start-up scripts and/or Upstart.
> >>> Unfortunately, systemd has mounts of devtmpfs on /dev and devpts
> >>> on /dev/pts which then break things horribly.  This is because the
> >>> kernel currently lacks namespaces for devices and won't for some time to
> >>> come (in design).  When devtmpfs gets mounted over top of /dev in the
> >>> container, it then hijacks the hosts console tty and several other
> >>> devices which had been set up through bind mounts by LXC and should have
> >>> been LEFT ALONE.
> >>>
> >>> Yes!  I recognize that this problem with devtmpfs and lack of namespaces
> >>> is a potential security problem anyways that could (and does) cause
> >>> serious container-to-host problems.  We're just not going to get that
> >>> fixed right away in the linux cgroups and namespaces.
> >>>
> >>> How do we work around this problem in systemd where it has hard coded
> >>> mounts in the binary that we can't override or configure?  Or is it
> >>> there and I'm just missing it trying to examine the sources?  That's how
> >>> I found where the problem lay.
> >> As a first step, this probably explains most of it:
> >>    http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface
> > A very long ways, yeah.  That looks like it could be just what we've
> > been looking for.  Just gotta figure out how to set that environment
> > variable but that's up to a couple of others to comment on in the
> > lxc-users list.  Then we'll see where we go from there.
> >
> > Many thanks!
> >
> >> Kay
> > Regards,
> > Mike
> >
> 
> I've just performed a very quick check on my Arch Linux system here.
> 
> on host (running systemd):
> # cat /proc/1/environ
> TERM=linuxRD_TIMESTAMP=
> 
> In a container (running sysvinit):
> # cat /proc/1/environ
> STY=623.systemd-lithiumTERM=screenTERMCAP=SC|screen|VT 100/ANSI X3.64 
> virtual terminal:\
>      :DO=\E[%dB:LE=\E[%dD:RI=\E[%dC:UP=\E[%dA:bs:bt=\E[Z:\
>      :cd=\E[J:ce=\E[K:cl=\E[H\E[J:cm=\E[%i%d;%dH:ct=\E[3g:\
>      :do=^J:nd=\E[C:pt:rc=\E8:rs=\Ec:sc=\E7:st=\EH:up=\EM:\
>      :le=^H:bl=^G:cr=^M:it#8:ho=\E[H:nw=\EE:ta=^I:is=\E)0:\
>      :li#24:co#80:am:xn:xv:LP:sr=\EM:al=\E[L:AL=\E[%dL:\
>      :cs=\E[%i%d;%dr:dl=\E[M:DL=\E[%dM:dc=\E[P:DC=\E[%dP:\
>      :im=\E[4h:ei=\E[4l:mi:IC=\E[%d@:ks=\E[?1h\E=:\
>      :ke=\E[?1l\E>:vi=\E[?25l:ve=\E[34h\E[?25h:vs=\E[34l:\
>      :ti=\E[?1049h:te=\E[?1049l:k0=\E[10~:k1=\EOP:k2=\EOQ:\
>      :k3=\EOR:k4=\EOS:k5=\E[15~:k6=\E[17~:k7=\E[18~:\
>      :k8=\E[19~:k9=\E[20~:k;=\E[21~:F1=\E[23~:F2=\E[24~:\
>      :kh=\E[1~:@1=\E[1~:kH=\E[4~:@7=\E[4~:kN=\E[6~:kP=\E[5~:\
> :kI=\E[2~:kD=\E[3~:ku=\EOA:kd=\EOB:kr=\EOC:kl=\EOD:WINDOW=0SHELL=/bin/shPATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/binLANG=en_GB.UTF-8container=lxc

> So it looks like that "container" environment variable is already set on 
> PID1

Yeah, I saw that myself last night.  Testing that out and it's still not
working here (although it doesn't seem to be grabbing the host console
now) if I use systemd but upstart fires right up and I see that
container variable set.  Looked like a number of mounts listed on that
wiki page.  Maybe something is missing.  Right now it's just hanging
trying to start the container and, when I subsequently try to shut the
container down it results in a hung resource and it can't delete the
cgroups directory because it's busy.  Only thing I did was change the
link to /sbin/init from upstart to systemd and it's now dead and I'll
have to reboot the host to free the resource.  :-P

> Regards,
> John

Regards,
Mike
-- 
Michael H. Warfield (AI4NB) | (770) 985-6132 |  mhw at WittsEnd.com
   /\/\|=mhw=|\/\/          | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9          | An optimist believes we live in the best of all
 PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure of it!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.freedesktop.org/archives/systemd-devel/attachments/20121022/1227a20c/attachment.pgp>


More information about the systemd-devel mailing list