[systemd-devel] [PATCH 1/2] Add detect_userns to detect uid/gid shifts

Stéphane Graber stgraber at ubuntu.com
Thu Jan 8 16:35:26 PST 2015

On Fri, Jan 09, 2015 at 01:16:15AM +0100, Tom Gundersen wrote:
> On Fri, Jan 9, 2015 at 12:55 AM, Stéphane Graber <stgraber at ubuntu.com> wrote:
> > I expect we'll run into some more problems when dealing with units that
> > start with their own view of /dev since mknod in a userns isn't allowed
> > but I haven't run into one of those yet so it's not very high on my list.
> >
> > Once that happens, I expect we can solve it either by again just
> > ignoring the failure or by catching the failure and falling back to
> > doing a bind-mount of the device in question from the parent /dev (which
> > works fine in a userns and is what we do today for nested containers
> > with LXC).
> Ignoring the failure as in starting services with an empty /dev sounds
> like it won't work. Also, just using the parent dev despite explicitly
> being asked not to sounds dangerous (most of the time there won't be
> much interesting stuff in /dev in a container, but that is not
> guaranteed).
> Bindmounting should obviously work, but might it not make even more
> sense to fix mknod in the kernel (as there are likely to be more
> places than just systemd that need fixing for this)? Even if it is
> just a minimal fix along the lines of "allow mknod whenver mount
> --bind would do the trick"? Based on the commit message here it sounds
> like people would not be opposed to the idea:
> <http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=975d6b3932d43b87a48d2107264ed0c9a7541d8d>.
> Cheers,
> Tom

Well, the problem is that you'd have to allow the mknod but then never
allow chmod or chown on the resulting file.

The reason for that is that you may have say /dev/sda in the parent
container which is owned by -1:-1 (unmapped uid, uid 0 on the host) and
has 600 as its permission.

This entry can be bind-mounted and it'll keep its mode and still not be

However if you interpreted that being able to bind-mount it means that
mknod is safe, then you could mknod it, chmod and chown it and then you
can do whatever you want to sda :)

So basically bind-mounts are good for that because the mount target
cannot then be chowned or chmoded even by uid 0 in the userns to grant
the user more right than he had outside the container.

Having run about 300 production unprivileged containers with various
services (web servers, mail servers, package builders, CI
infrastructure, ...) for over a year now, I'm yet to run into a common
piece of software which requires mknod and doesn't already have a
fallback mechanism.

We've also been discussing at Plumbers and other conferences ways to
intercept the mknod and mount syscalls at the container manager layer so
that we can have a privileged userspace service handle policies as to
what's fine and what's not and then do the actual action on the
container's behalf.

Something sort of what seccomp provides but rather than just setting an
in-kernel policy for a given combination of syscall and arguments, have
it defer to a userspace service instead.

However at this point, all of this is talks and there's no kernel code to
offer that kind of interface that I'm aware of.

Stéphane Graber
Ubuntu developer
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <http://lists.freedesktop.org/archives/systemd-devel/attachments/20150108/2bf07122/attachment-0001.sig>

More information about the systemd-devel mailing list