[systemd-devel] "dynamic" uid allocation (was: [PATCH] loopback setup in unprivileged containers)

Tue Feb 3 08:34:30 PST 2015

Quoting Lennart Poettering (lennart at poettering.net):
> On Tue, 03.02.15 15:03, Daniel P. Berrange (berrange at redhat.com) wrote:
> 
> > > Hmm, so, I thought a lot about this in the past weeks. I think the way
> > > I'd really like to see this work in the end is that we never have to
> > > persist the UID mappings. This could work if the kernel would provide
> > > us with the ability to bind mount a file system into the container
> > > applying a fixed UID shift. That way, the shifted UIDs would never hit
> > > the actual disk, and hence we wouldn't have to persist their mappings.
> > > 
> > > Instead on each container startup we'd look for a new UID range, and
> > > release it entirely when the container shuts down. The bind mount with
> > > UID shift would then shift the UIDs up, the userns stuff would shift
> > > it down from inside the container again.
> > > 
> > > Of course, this all depends on whether the kernel will get an
> > > extension to apply uid shifts to bind mounts. I hear they want to
> > > provide this, but let's see.
> > 
> > I would dearly love to see that happen. Having to recursively change

It'd definately be useful (though not without issues).

> > the UID/GID on entire filesystem sub-trees given to containers with
> > userns is a real unpleasant thing to have to deal with. I'd not want

Of course you would *not* want to take a stock rootfs where uid == 0
and shift that into the container, as that would give root in the
container a chance to write root-owned files on the host to leverage
later in a convoluted attack :)  We might want to come up with a
containers concensus that container rootfs's are always shipped with
uid range 0-65535 -> 100000-165535.  That still leaves a chance for
container A (mapped to 200000-265535) to write valid setuid-root
binary for container B (mapped to 300000-365535), which isn't possible
otherwise.  But that's better than doing so for host-root.

> > the filesystem UID shift to only apply to bind mounts though. It is
> > not uncommon to use a disk image[1] for a container's filesystem, so
> > being able to request a UID shift on *any* filesystem mount is pretty
> > desirable, rather than having to mount the image and then bind mount
> > it onto itself just to apply the UID shift.
> 
> Well, you can always change the bind mount flags without creating a
> new bind mount with MS_BIND|MS_REMOUNT.
> 
> > [1] Using a separate disk image per container means a container can't
> >     DOS other containers by exhausting inodes for example with $millions
> >     of small files.
> 
> Indeed. I'd claim that without such a concept of mount point uid
> shifting the whole userns story is not very useful IRL...

I had always thought this would eventually be done using a stackable
filesystem, but doing it at bind mount time would be neat too, and
less objectionable to the kernel folks.  (Though overlayfs is in now,
so <shrug>)

I'm actually quite surprised noone has sat down and written a
stackable uid-shifting fs yet.

-serge