[systemd-devel] "dynamic" uid allocation (was: [PATCH] loopback setup in unprivileged containers)

Tue Feb 3 09:05:00 PST 2015

On Tue, 03.02.15 16:34, Serge Hallyn (serge.hallyn at ubuntu.com) wrote:

> > > the UID/GID on entire filesystem sub-trees given to containers with
> > > userns is a real unpleasant thing to have to deal with. I'd not want
> 
> Of course you would *not* want to take a stock rootfs where uid == 0
> and shift that into the container, as that would give root in the
> container a chance to write root-owned files on the host to leverage
> later in a convoluted attack :)  

Is this really a problem? I mean, the only way how this could be
exploitable is if people make the container hierarchy accessible to
other users, but that should be easy to prohibit by making the
container's parent dir 0700, which we already do for nspawn's
container in /var/lib/machines... The only other risk I can see here
is that if people use traditional ext4 quota, then the container's
disk usage will be added to the host's usage. But that's easy to
avoid, by simply never placing container images and the host on the
same quota device...

Also, in the case of systemd-nspawn we strongly emphasize usage with
loopback devices. In that case there's no vulnerability at all, since
the device is completely seperate from the host fs, and it will only
be mounted in the container, but not in the host...

> We might want to come up with a containers concensus that container
> rootfs's are always shipped with uid range 0-65535 -> 100000-165535.
> That still leaves a chance for container A (mapped to 200000-265535)
> to write valid setuid-root binary for container B (mapped to
> 300000-365535), which isn't possible otherwise.  But that's better
> than doing so for host-root.

Well, ultimately I'd recommend an automatism like this for container
managers: 

   a) if not otherwise configured, let's give each container their own
      16bit of uids. This would mean each 32bit uid could be neatly
      split into the upper 16bit that would become a "container" id,
      plus the lower 16bit for the actual "virtual" UID.

   b) we will never set up UID ranges orthogonal from GID ranges.

   c) when a container image is started, the container manager first
      checks the UID/GID owner of the root of the root file system. It
      masks the lower 16bit away, and only looks for the upper 16bit.

   d) It will then look for an unused container id (which means, an
      unused range of 64K UIDs), and then shifts the offset it
      identified following c) to this new container id.

With that in place it doesn't really matter which base people use in
their containers, the container manager would do the right thing, and
shift everything into the right place. Paranoid people could ship
their container images shifted to some ID of their choice, and lazy
folks could just ship their container images with base 0, but then
must make sure they don't give anybody else access to the hierarchy,
and don't confuse quota...

> > > [1] Using a separate disk image per container means a container can't
> > >     DOS other containers by exhausting inodes for example with $millions
> > >     of small files.
> > 
> > Indeed. I'd claim that without such a concept of mount point uid
> > shifting the whole userns story is not very useful IRL...
> 
> I had always thought this would eventually be done using a stackable
> filesystem, but doing it at bind mount time would be neat too, and
> less objectionable to the kernel folks.  (Though overlayfs is in now,
> so <shrug>)
> 
> I'm actually quite surprised noone has sat down and written a
> stackable uid-shifting fs yet.

If it's done as part of bind mounts, or as an extension of overlayfs,
or in a completely new fs, doesn't really matter to me. I'd certainly
welcome a solution based on any of these options!

Lennart

-- 
Lennart Poettering, Red Hat