[systemd-devel] [HEADS-UP] systemd and Storage Daemons for the Root File System

Wed Jan 11 07:04:58 PST 2012

On Wednesday 2012-01-11 15:26, Lennart Poettering wrote:

>On Wed, 11.01.12 14:44, Jan Engelhardt (jengelh at medozas.de) wrote:
>
>> >> Forcing the use of @ introduces a policy, which should preferably not be 
>> >> done. Since programs started from the initrd obviously should be having 
>> >> a /proc/*/{cwd,exe} symlinks pointing to the initramfs vfsmount.
>> >
>> >They are in a different namespace, so that wouldn't work.
>> 
>> Namespace as in clone(2)'s CLONE_NEWNS?
>
>No, my expression was a bit unclean there.
>
>What I meant is that the path in argv[0] and similar stops making sense
>after the switch to the root fs, since we did a MS_MOVE there, which
>invalidates all old paths...

Yeah, I was not talking about argv[0], since that is user-changable 
anywhow. My words were about /proc/self/exe, which is a link to the 
absolute path - and may not be the same as argv[0].

>But yeah, there's no new vfs namespace opened, just some major changes in
>what means what in the original namespace.

Since everybody seems to have a brainknot right now, let's attempt to 
shed some more light.

A mount namespace is a set of vfsmounts. The vfsmounts in your current 
mount namespace can be obtained through /proc/self/mounts or others like 
mountinfo. CLONE_NEWNS in a clone(2) call creates a new namespace, 
inheriting all vfsmounts and their positions, and this is the only way 
(I know) to create one. chroot does _NOT_ create a new mount namespace, 
because a vfsmount created within the chroot can be unmounted from a 
different process not inside the chroot jail.

Since the system initialization procedures in dracut and systemd don't 
issue CLONE_NEWNS as I gather, we can completely ignore namespaces this 
instant.

Now, the kernel has a rootfs-type vfsmount initially mounted on /. This 
is where your initramfs cpio is extracted to. You can see it as being 
the first entry in /proc/self/mounts ("rootfs / rootfs rw 0 0").

Since commands tell more than a thousand words:

  /bin/sleep 99999 &
  pid=$!
  mount /dev/sda3 /mnt
  mkdir /mnt/var/run/rootfs
  cd /mnt
  pivot_root /mnt /mnt/var/run/rootfs
  readlink -f /proc/$pid/exe

  => should now yield /var/run/rootfs/bin/sleep

Therefore you can detect which programs where started inside the rootfs 
vfsmount. That information can then influence killing decisions as 
needed.

Now, Kay Sievers claims (on IRC) pivot_root is "10 years ago stuff" and 
points to util-linux's switchroot function for how things are supposedly 
to be done today. But, as we look at 
http://git.kernel.org/?p=utils/util-linux/util-linux.git;a=blob;f=sys-utils/switch_root.c;hb=HEAD#l150 
what can really be seen there is that the new root (/dev/sda3 in my 
previous commands example) is just mounted atop the rootfs-type 
vfsmount, thereby concealing it.
(That is not a replacement for what pivot_root does, really.)

Of course, if you conceal the rootfs-type vfsmount, there is no way that 
the proc trick is going to work -- which is why I proposed using 
pivot_root instead of {MS_MOVE + chroot} and *keeping* the rootfs 
vfsmount around, in a visible fashion.

Similarly, when systemd wants to return to the initramfs, it can just 
pivot_root again, this time by

	cd /var/run/rootfs
	pivot_root /var/run/rootfs /var/run/rootfs/mnt

(or the C equivalent using pivot_root(2) of course.)