[systemd-devel] systemd and nested Btrfs subvolumes

Chris Murphy lists at colorremedies.com
Thu Mar 19 18:27:33 PDT 2015

Short version:
  Instead of machinectl clone using btrfs snapshots, or even needing
to store things in a var/lib/machines Btrfs subvolume, does it meet
the requirements for Btrfs optimization to do this with cp -a
--reflink instead?

Why? Nested subvolumes are confusing. And nested subvolumes are
excluded from snapshots. Subvolume B inside of Subvolume A, will not
be snapshot or rolled back, if I snapshot Subvolume A and subsequently
rollback to the snapshot of A. Is this the intended workflow?

Long version comparing two different Btrfs layout paramdigms:
  New in systemd-219 is the creation of a subvolume for storing
containers (machines). This is what things look like on Fedora 22
right now.

# btrfs sub list -a /
ID 257 gen 1705 top level 5 path <FS_TREE>/root
ID 258 gen 1662 top level 5 path <FS_TREE>/home
ID 259 gen 1681 top level 5 path <FS_TREE>/boot
ID 262 gen 1705 top level 257 path root/var/lib/machines

# cat etc fstab
UUID=<uuid>  /                       btrfs   subvol=root     0 0
UUID=<uuid   /boot                   btrfs   subvol=boot     0 0
UUID=<uuid   /home                   btrfs   subvol=home     0 0

And I see the entry in 'man machinectl' clone entry which says:
       clone NAME NAME
           Clones a container or VM image. The arguments specify the name of
           the image to clone and the name of the newly cloned image. Note
           that plain directory container images are cloned into subvolume
           images with this command. Note that cloning a container or VM
           image is optimized for btrfs file systems, and might not be
           efficient on others, due to file system limitations.

The problem is that nested subvolumes like var/lib/machines becomes
tricky. If I snapshot the root subvolume, it doesn't contain
var/lib/machines. If I rollback to root.n-1 and boot that, I no longer
have var/lib/machines at all because it's now in an old unused
subvolume and not in the currently mounted path.

Can clone instead use cp -a --reflink instead of using snapshots? Can
subvolumes be entirely avoided?

Right now there is a split in Btrfs layout paradigms. The Fedora/RH
method is to create non-nested subvolumes in the (permanent) top level
subvolume (ID 5 a.k.a. ID 0), and then use fstab to mount those
subvolumes to the proper mount points in the conventional FHS we're
used to. The top level of the file system is never actually mounted by
default on Fedora. This paradigm suggests a 'machines' subvolume
should go in the top level, and an fstab entry to mount it via:

UUID=<uuid   /var/lib/machines                   btrfs   subvol=machines     0 0

Now, the other paradigm, is from openSUSE. The top level (subvol ID 5)
is populated like any other filesystem would be, but many of the usual
directories are instead subvolumes. There are 14 subvolumes to be
exact, and they're subvolumes for the purpose of excluding them from
the snapshotting and rolling back policies of root fs. In this
paradigm, root is the only thing being snapshot and rolled back, and
to avoid rolling back things like the journal and logs, they use
subvolumes because nested subvolumes do not get snapshot - there's no
recursive snapshotting of nested subvolumes on Btrfs.

# cat /etc/fstab
UUID=<uuid> swap swap defaults 0 0
UUID=<uuid> / btrfs defaults 0 0
UUID=<uuid> /boot/grub2/i386-pc btrfs subvol=boot/grub2/i386-pc 0 0
UUID=<uuid> /boot/grub2/x86_64-efi btrfs subvol=boot/grub2/x86_64-efi 0 0
UUID=<uuid> /opt btrfs subvol=opt 0 0
UUID=<uuid> /srv btrfs subvol=srv 0 0
UUID=<uuid> /tmp btrfs subvol=tmp 0 0
UUID=<uuid> /usr/local btrfs subvol=usr/local 0 0
UUID=<uuid> /var/crash btrfs subvol=var/crash 0 0
UUID=<uuid> /var/lib/mailman btrfs subvol=var/lib/mailman 0 0
UUID=<uuid> /var/lib/named btrfs subvol=var/lib/named 0 0
UUID=<uuid> /var/lib/pgsql btrfs subvol=var/lib/pgsql 0 0
UUID=<uuid> /var/log btrfs subvol=var/log 0 0
UUID=<uuid> /var/opt btrfs subvol=var/opt 0 0
UUID=<uuid> /var/spool btrfs subvol=var/spool 0 0
UUID=<uuid> /var/tmp btrfs subvol=var/tmp 0 0
UUID=<uuid> /home xfs defaults 1 2
UUID=<uuid> /.snapshots btrfs subvol=.snapshots 0 0

If today, on openSUSE, a hypothetical 'machines' subvolume were
created, it would most likely need to go in a /.subvolume subvolume to
keep it out of the expected FHS listing at / and this is also
consistent with the /.snapshots subvolume in which they keep

Therefore, it's a somewhat nested subvolume strategy, with a heavy
duty fstab to reassemble the thing, and quite substantial grub2
patches so that snapshots are visible and can be rolled back from the
boot menu.

So back to cp -a --reflink as a work around for both paradigms? Does
this method of cloning meet the requirements for systemd containers?
If so, it works on both the RH/Fedora and the openSUSE layouts.
Meaning the var/lib/machines subvolume isn't needed, just use cp -a
--reflink on either directories or files, and it's almost as fast as a
btrfs snapshot.


Chris Murphy

More information about the systemd-devel mailing list