[systemd-devel] [HEADS-UP] systemd and Storage Daemons for the Root File System
Jan Engelhardt
jengelh at medozas.de
Wed Jan 11 07:04:58 PST 2012
On Wednesday 2012-01-11 15:26, Lennart Poettering wrote:
>On Wed, 11.01.12 14:44, Jan Engelhardt (jengelh at medozas.de) wrote:
>
>> >> Forcing the use of @ introduces a policy, which should preferably not be
>> >> done. Since programs started from the initrd obviously should be having
>> >> a /proc/*/{cwd,exe} symlinks pointing to the initramfs vfsmount.
>> >
>> >They are in a different namespace, so that wouldn't work.
>>
>> Namespace as in clone(2)'s CLONE_NEWNS?
>
>No, my expression was a bit unclean there.
>
>What I meant is that the path in argv[0] and similar stops making sense
>after the switch to the root fs, since we did a MS_MOVE there, which
>invalidates all old paths...
Yeah, I was not talking about argv[0], since that is user-changable
anywhow. My words were about /proc/self/exe, which is a link to the
absolute path - and may not be the same as argv[0].
>But yeah, there's no new vfs namespace opened, just some major changes in
>what means what in the original namespace.
Since everybody seems to have a brainknot right now, let's attempt to
shed some more light.
A mount namespace is a set of vfsmounts. The vfsmounts in your current
mount namespace can be obtained through /proc/self/mounts or others like
mountinfo. CLONE_NEWNS in a clone(2) call creates a new namespace,
inheriting all vfsmounts and their positions, and this is the only way
(I know) to create one. chroot does _NOT_ create a new mount namespace,
because a vfsmount created within the chroot can be unmounted from a
different process not inside the chroot jail.
Since the system initialization procedures in dracut and systemd don't
issue CLONE_NEWNS as I gather, we can completely ignore namespaces this
instant.
Now, the kernel has a rootfs-type vfsmount initially mounted on /. This
is where your initramfs cpio is extracted to. You can see it as being
the first entry in /proc/self/mounts ("rootfs / rootfs rw 0 0").
Since commands tell more than a thousand words:
/bin/sleep 99999 &
pid=$!
mount /dev/sda3 /mnt
mkdir /mnt/var/run/rootfs
cd /mnt
pivot_root /mnt /mnt/var/run/rootfs
readlink -f /proc/$pid/exe
=> should now yield /var/run/rootfs/bin/sleep
Therefore you can detect which programs where started inside the rootfs
vfsmount. That information can then influence killing decisions as
needed.
Now, Kay Sievers claims (on IRC) pivot_root is "10 years ago stuff" and
points to util-linux's switchroot function for how things are supposedly
to be done today. But, as we look at
http://git.kernel.org/?p=utils/util-linux/util-linux.git;a=blob;f=sys-utils/switch_root.c;hb=HEAD#l150
what can really be seen there is that the new root (/dev/sda3 in my
previous commands example) is just mounted atop the rootfs-type
vfsmount, thereby concealing it.
(That is not a replacement for what pivot_root does, really.)
Of course, if you conceal the rootfs-type vfsmount, there is no way that
the proc trick is going to work -- which is why I proposed using
pivot_root instead of {MS_MOVE + chroot} and *keeping* the rootfs
vfsmount around, in a visible fashion.
Similarly, when systemd wants to return to the initramfs, it can just
pivot_root again, this time by
cd /var/run/rootfs
pivot_root /var/run/rootfs /var/run/rootfs/mnt
(or the C equivalent using pivot_root(2) of course.)
More information about the systemd-devel
mailing list