[systemd-devel] systemd-tmpfiles subvolume handling vs. changing default btrfs root

Ignaz Forster iforster at suse.de
Fri Jun 29 19:04:18 UTC 2018


Reordered the quotes below for better reading flow.

Am 28.06.2018 um 10:52 schrieb Lennart Poettering:
>>> But quite frankly I don't grok the problem at hand, i.e. what you are
>>> trying to do, even.
>>
>> Was this explanation any better?
>
> Not really still, what I don't grok what precisely a "system snapshot"
> in suse terms is actually supposed to entail. Is it supposed to
> contain only the vendor RPMs, i.e. only /usr?

That's the general idea, yes.*

Everything which contains variable or user data (i.e. which is not 
supposed to be rolled back like databases or files created by the user) 
will be put onto an own subvolume or partition.

For reference here's how this looks like on openSUSE Leap 15 again:
ID     parent top lvl path
--     ------ ------- ----
257    5      5       <FS_TREE>/@
258    257    257     <FS_TREE>/@/var
259    257    257     <FS_TREE>/@/usr/local
260    257    257     <FS_TREE>/@/tmp
261    257    257     <FS_TREE>/@/srv
262    257    257     <FS_TREE>/@/root
263    257    257     <FS_TREE>/@/opt
264    257    257     <FS_TREE>/@/home
265    257    257     <FS_TREE>/@/boot/grub2/x86_64-efi
266    257    257     <FS_TREE>/@/boot/grub2/i386-pc
267    257    257     <FS_TREE>/@/.snapshots
411    267    267     <FS_TREE>/@/.snapshots/138/snapshot
412    267    267     <FS_TREE>/@/.snapshots/139/snapshot


*) Some packages will still use /bin, /lib and the like, and those will 
be part of the snapshot; on the other hand distribution RPMs may also 
contain files or directories in e.g. /var, which will not be part of the 
snapshot. Because of that I'd prefer the term "static / read-only / 
unmodifiable part of the root file system" instead of "vendor RPMs".

> or everything except
> /home, /srv, /var, /tmp?

Everything except the directories listed above, because those contain 
variable data which one usually doesn't want to reset just because e.g. 
a new kernel doesn't boot.
That won't prevent the user from creating his own snapshots of these 
subvolumes of course.

>>> systemd will never create disassociated subvolumes for you.
>>
>> That's the problem - it will create subvolumes which will just disappear
>> from the system when switching to the next snapshot.
>
> Well, no, if snapshots are done recursively they wouldn't, they would
> be switched at the same time.

I think it's not relevant for this discussion, you were repeatedly 
talking about recursive snapshots now, however as far as I'm aware btrfs 
is not capable to doing that. I've found a patchset on 
https://www.spinics.net/lists/linux-btrfs/msg29205.html, but it seems 
the relevant parts for snapshot creation weren't added upstream.

So how are those recursive btrfs snapshots supposed to work?

> tmpfiles won't create any subvolumes for you — except if they are
> missing. tmpfiles can't guess the complex mappings you applied to your
> tree, it can't know that you don't want to allow recursive snapshots,
> but place them all in the same dir and bind mount them. Also, if I
> understand correctly the way suse sets this up always *requires*
> additions to fstab for any subvol created, which is clearly out of
> focus for tmpfiles.

I agree that it's next to impossible to programmatically find out what a 
user intended to do with a specific layout.
However in my opinion it would be preferable to create at least a 
working, though maybe not optimal configuration compared to a 
configuration which is known to break in several cases (independent of 
the distribution).

Instead of adding fstab entries (which I also have a bellyache with) it 
may be an alternative to create a mount unit instead. But yes, something 
would have to be done to mount those subvolumes on boot.

> Also, tmpfiles won't actually create any subvols below /usr (unless a
> user dropped something in to do that on its own), it will only do so
> in the root dir for precisely /var, /tmp, /home and /srv. All others
> are created below /var. Which means you rule of "don't create subvols
> below system directories" isn't actually touched, because the
> read-only OS is monopolized in /usr anyway... Or maybe I am still not
> getting what you are trying to say?

The rule would be "don't create subvols below snapshots", and the 
read-only OS is not exactly monopolized in /usr either (not only because 
of /bin, /lib etc, but also because of /boot - see last paragraph of the 
mail), but apart from that that nails it.

The issue was originally discovered when upgrading systemd on an older 
openSUSE machine which did not have a unified /var subvolume, so 
/var/lib/machines got attached to the root subvolume.
This may happen again in the future for us, but as said we are not the 
only ones using this mechanism. Seeing the default Fedora and Ubuntu 
btrfs layouts it's even more likely to happen if anybody is using 
pattern 3 there. Apart from that I'd prefer systemd-tmpfiles to work 
even if a user threw in something unexpected.

I'm wondering if just refusing to create a subvolume on a snapshot would 
be another option... That way the problem would be given back to the 
user or distribution.

>>> The assumption systemd-tmpfiles makes is always that the subvolumes
>>> it implicitly creates for you if they are missing are associated
>>> with the subvolume they are created below, and that this means they
>>> are snapshotted, removed and otheerwise managed along with them.
>>
>> Keeping this logic more or less assumes that snapshots will always be used
>> as static backups and pattern 3 from above must not be used.
> 
> I don't see that at all. I mean, this all depends how you want to
> associate /var with /. my assumption is that they belong together, but
> i figure that's not what you have in mind? you want to keep using the
> same /var even though you switch back and forth to different /?

Exactly - viewing them as separate entities after installation has 
proven to work very reliably for us and is documented accordingly.
As said above the reasoning behind this is that you usually don't want 
to loose e.g. all accumulated databases changes just because you have to 
revert the system state due to a failed package update.

> i am not sure if follow fully, but i think the model should be the
> other way round: keep the root file system in one subvolume, and keep
> /usr completely separate from that, and only combine the two through
> bind mounts when you want to go for one specific version. In that
> mode, all subvolumes systemd generates would be children of the root
> subvolume, as they should be, but /usr would be separate.

Currently the snapshot contains everything which is relevant for a 
complete rollback of the system including /boot and /.snapshots 
(containing snapper metadata). Splitting this up into three (or more) 
separate subvolumes would be a major architectural change. I'll think 
about this over the weekend, but I don't think I like the idea - 
synchronizing those volumes will probably be a nightmare.

Ignaz
-- 
Ignaz Forster <iforster at suse.com>
Research Engineer
SUSE Linux GmbH, Maxfeldstr. 5, D-90409 Nürnberg
Tel: +49-911-74053-281;  https://www.suse.com/
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard,
Graham Norton, HRB 21284 (AG Nürnberg)


More information about the systemd-devel mailing list