[systemd-devel] [PATCH 1/2] Add detect_userns to detect uid/gid shifts

Fri Jan 9 07:02:45 PST 2015

On Thu, 08.01.15 18:55, Stéphane Graber (stgraber at ubuntu.com) wrote:

> On Fri, Jan 09, 2015 at 12:39:23AM +0100, Lennart Poettering wrote:
> > On Thu, 08.01.15 15:33, Stéphane Graber (stgraber at ubuntu.com) wrote:
> > 
> > > As far as I know there's no obvious way to detect this case (well,
> > > short of trying a bunch of restricted syscalls). The only way I'm
> > > aware of is by comparing the target of /proc/self/ns/user to that of
> > > /proc/<real host pid 1>/ns/user which is doable at the host level
> > > but isn't once you are in a container with your own pid namespace
> > > (which since we're talking about pid 1 systemd there can probably be
> > > assumed).
> > 
> > Hmm, if this is so unreliable to detect maybe we shouldn't after all.
> > 
> > Given that git is no longer fatally failing if it cannot write to oom
> > adjust I think all is good now?
> 
> Yeah, I think we're good for now. I've got systemd running fine in an
> unprivileged container here, booting without problems to a shell and
> with all the basic services running as expected (and those I was
> expecting to fail, failed but didn't block the boot in any way).
> 
> I expect we'll run into some more problems when dealing with units that
> start with their own view of /dev since mknod in a userns isn't allowed
> but I haven't run into one of those yet so it's not very high on my list.
> 
> Once that happens, I expect we can solve it either by again just
> ignoring the failure or by catching the failure and falling back to
> doing a bind-mount of the device in question from the parent /dev (which
> works fine in a userns and is what we do today for nested containers
> with LXC).

Note that most of systemd's own daemons use PrivateDevices=,
PrivateTmp= and suchlike by default, hence you couldn't really start
much if this wouldn't work...

A while back we added some changes to make permission problems with fs
namespaces graceful. This was done to support CAP_SYS_ADMIN-less
containers, which cannot even mount:

http://cgit.freedesktop.org/systemd/systemd/tree/src/core/execute.c#n1584

I figure userns containers are only slightly less limited thatn
CAP_SYS_ADMIN-less containers are, hence I think for most purposes you
should already be fine...

Lennart

-- 
Lennart Poettering, Red Hat