[systemd-devel] "dynamic" uid allocation (was: [PATCH] loopback setup in unprivileged containers)

Wed Feb 4 01:01:35 PST 2015

On Tue, Feb 03, 2015 at 06:05:00PM +0100, Lennart Poettering wrote:
> On Tue, 03.02.15 16:34, Serge Hallyn (serge.hallyn at ubuntu.com) wrote:
> 
> > > > the UID/GID on entire filesystem sub-trees given to containers with
> > > > userns is a real unpleasant thing to have to deal with. I'd not want
> > 
> > Of course you would *not* want to take a stock rootfs where uid == 0
> > and shift that into the container, as that would give root in the
> > container a chance to write root-owned files on the host to leverage
> > later in a convoluted attack :)  
> 
> Is this really a problem? I mean, the only way how this could be
> exploitable is if people make the container hierarchy accessible to
> other users, but that should be easy to prohibit by making the
> container's parent dir 0700, which we already do for nspawn's
> container in /var/lib/machines... The only other risk I can see here
> is that if people use traditional ext4 quota, then the container's
> disk usage will be added to the host's usage. But that's easy to
> avoid, by simply never placing container images and the host on the
> same quota device...
> 
> Also, in the case of systemd-nspawn we strongly emphasize usage with
> loopback devices. In that case there's no vulnerability at all, since
> the device is completely seperate from the host fs, and it will only
> be mounted in the container, but not in the host...

NB, that the container filesystem is visible via /proc/$PID/root,
but I agree with you in general. I don't see a reason to avoid
the scenario Serge mentioned. Indeed I think it is important that
we explicitly support it, because ultimately I think we need to
be able to take any arbitrary disk image and safely boot it in
either a container or virtual machine. ie we should not have to
build custom images just for containers - any such need should be
considered a failure of the technology / impl IMHO.

> > We might want to come up with a containers concensus that container
> > rootfs's are always shipped with uid range 0-65535 -> 100000-165535.
> > That still leaves a chance for container A (mapped to 200000-265535)
> > to write valid setuid-root binary for container B (mapped to
> > 300000-365535), which isn't possible otherwise.  But that's better
> > than doing so for host-root.
> 
> Well, ultimately I'd recommend an automatism like this for container
> managers: 
> 
>    a) if not otherwise configured, let's give each container their own
>       16bit of uids. This would mean each 32bit uid could be neatly
>       split into the upper 16bit that would become a "container" id,
>       plus the lower 16bit for the actual "virtual" UID.
> 
>    b) we will never set up UID ranges orthogonal from GID ranges.
> 
>    c) when a container image is started, the container manager first
>       checks the UID/GID owner of the root of the root file system. It
>       masks the lower 16bit away, and only looks for the upper 16bit.
> 
>    d) It will then look for an unused container id (which means, an
>       unused range of 64K UIDs), and then shifts the offset it
>       identified following c) to this new container id.
> 
> With that in place it doesn't really matter which base people use in
> their containers, the container manager would do the right thing, and
> shift everything into the right place. Paranoid people could ship
> their container images shifted to some ID of their choice, and lazy
> folks could just ship their container images with base 0, but then
> must make sure they don't give anybody else access to the hierarchy,
> and don't confuse quota...

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|