[systemd-devel] systemd-tmpfiles subvolume handling vs. changing default btrfs root
Ignaz Forster
iforster at suse.de
Fri Jun 29 19:04:18 UTC 2018
Reordered the quotes below for better reading flow.
Am 28.06.2018 um 10:52 schrieb Lennart Poettering:
>>> But quite frankly I don't grok the problem at hand, i.e. what you are
>>> trying to do, even.
>>
>> Was this explanation any better?
>
> Not really still, what I don't grok what precisely a "system snapshot"
> in suse terms is actually supposed to entail. Is it supposed to
> contain only the vendor RPMs, i.e. only /usr?
That's the general idea, yes.*
Everything which contains variable or user data (i.e. which is not
supposed to be rolled back like databases or files created by the user)
will be put onto an own subvolume or partition.
For reference here's how this looks like on openSUSE Leap 15 again:
ID parent top lvl path
-- ------ ------- ----
257 5 5 <FS_TREE>/@
258 257 257 <FS_TREE>/@/var
259 257 257 <FS_TREE>/@/usr/local
260 257 257 <FS_TREE>/@/tmp
261 257 257 <FS_TREE>/@/srv
262 257 257 <FS_TREE>/@/root
263 257 257 <FS_TREE>/@/opt
264 257 257 <FS_TREE>/@/home
265 257 257 <FS_TREE>/@/boot/grub2/x86_64-efi
266 257 257 <FS_TREE>/@/boot/grub2/i386-pc
267 257 257 <FS_TREE>/@/.snapshots
411 267 267 <FS_TREE>/@/.snapshots/138/snapshot
412 267 267 <FS_TREE>/@/.snapshots/139/snapshot
*) Some packages will still use /bin, /lib and the like, and those will
be part of the snapshot; on the other hand distribution RPMs may also
contain files or directories in e.g. /var, which will not be part of the
snapshot. Because of that I'd prefer the term "static / read-only /
unmodifiable part of the root file system" instead of "vendor RPMs".
> or everything except
> /home, /srv, /var, /tmp?
Everything except the directories listed above, because those contain
variable data which one usually doesn't want to reset just because e.g.
a new kernel doesn't boot.
That won't prevent the user from creating his own snapshots of these
subvolumes of course.
>>> systemd will never create disassociated subvolumes for you.
>>
>> That's the problem - it will create subvolumes which will just disappear
>> from the system when switching to the next snapshot.
>
> Well, no, if snapshots are done recursively they wouldn't, they would
> be switched at the same time.
I think it's not relevant for this discussion, you were repeatedly
talking about recursive snapshots now, however as far as I'm aware btrfs
is not capable to doing that. I've found a patchset on
https://www.spinics.net/lists/linux-btrfs/msg29205.html, but it seems
the relevant parts for snapshot creation weren't added upstream.
So how are those recursive btrfs snapshots supposed to work?
> tmpfiles won't create any subvolumes for you — except if they are
> missing. tmpfiles can't guess the complex mappings you applied to your
> tree, it can't know that you don't want to allow recursive snapshots,
> but place them all in the same dir and bind mount them. Also, if I
> understand correctly the way suse sets this up always *requires*
> additions to fstab for any subvol created, which is clearly out of
> focus for tmpfiles.
I agree that it's next to impossible to programmatically find out what a
user intended to do with a specific layout.
However in my opinion it would be preferable to create at least a
working, though maybe not optimal configuration compared to a
configuration which is known to break in several cases (independent of
the distribution).
Instead of adding fstab entries (which I also have a bellyache with) it
may be an alternative to create a mount unit instead. But yes, something
would have to be done to mount those subvolumes on boot.
> Also, tmpfiles won't actually create any subvols below /usr (unless a
> user dropped something in to do that on its own), it will only do so
> in the root dir for precisely /var, /tmp, /home and /srv. All others
> are created below /var. Which means you rule of "don't create subvols
> below system directories" isn't actually touched, because the
> read-only OS is monopolized in /usr anyway... Or maybe I am still not
> getting what you are trying to say?
The rule would be "don't create subvols below snapshots", and the
read-only OS is not exactly monopolized in /usr either (not only because
of /bin, /lib etc, but also because of /boot - see last paragraph of the
mail), but apart from that that nails it.
The issue was originally discovered when upgrading systemd on an older
openSUSE machine which did not have a unified /var subvolume, so
/var/lib/machines got attached to the root subvolume.
This may happen again in the future for us, but as said we are not the
only ones using this mechanism. Seeing the default Fedora and Ubuntu
btrfs layouts it's even more likely to happen if anybody is using
pattern 3 there. Apart from that I'd prefer systemd-tmpfiles to work
even if a user threw in something unexpected.
I'm wondering if just refusing to create a subvolume on a snapshot would
be another option... That way the problem would be given back to the
user or distribution.
>>> The assumption systemd-tmpfiles makes is always that the subvolumes
>>> it implicitly creates for you if they are missing are associated
>>> with the subvolume they are created below, and that this means they
>>> are snapshotted, removed and otheerwise managed along with them.
>>
>> Keeping this logic more or less assumes that snapshots will always be used
>> as static backups and pattern 3 from above must not be used.
>
> I don't see that at all. I mean, this all depends how you want to
> associate /var with /. my assumption is that they belong together, but
> i figure that's not what you have in mind? you want to keep using the
> same /var even though you switch back and forth to different /?
Exactly - viewing them as separate entities after installation has
proven to work very reliably for us and is documented accordingly.
As said above the reasoning behind this is that you usually don't want
to loose e.g. all accumulated databases changes just because you have to
revert the system state due to a failed package update.
> i am not sure if follow fully, but i think the model should be the
> other way round: keep the root file system in one subvolume, and keep
> /usr completely separate from that, and only combine the two through
> bind mounts when you want to go for one specific version. In that
> mode, all subvolumes systemd generates would be children of the root
> subvolume, as they should be, but /usr would be separate.
Currently the snapshot contains everything which is relevant for a
complete rollback of the system including /boot and /.snapshots
(containing snapper metadata). Splitting this up into three (or more)
separate subvolumes would be a major architectural change. I'll think
about this over the weekend, but I don't think I like the idea -
synchronizing those volumes will probably be a nightmare.
Ignaz
--
Ignaz Forster <iforster at suse.com>
Research Engineer
SUSE Linux GmbH, Maxfeldstr. 5, D-90409 Nürnberg
Tel: +49-911-74053-281; https://www.suse.com/
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard,
Graham Norton, HRB 21284 (AG Nürnberg)
More information about the systemd-devel
mailing list