[systemd-devel] journal fragmentation on Btrfs

Mon Apr 17 16:25:57 UTC 2017

On Mon, Apr 17, 2017 at 3:57 AM, Lennart Poettering
<lennart at poettering.net> wrote:

>> I do manual snapshots before software updates, which means new writes
>> to these files are subject to COW, but additional writes to the same
>> extents are overwrites and are not COW because of chattr +C. I've used
>> this same strategy for a long time, since systemd-journald defaults to
>> +C for journal files; but I've not seen them get this fragmented this
>> quickly.
>>
>
> IIRC NOCOW only has an effect if set right after the file is created
> before the first write to it is done. Or in other words, you cannot
> retroactively make a file NOCOW. This means that if you in one way or
> another make a COW copy of a file (through reflinking — implicit or
> not, note that "cp" reflinks by default — or through snapshotting or
> something else) the file is COW and you'll get fragmentation.

Correct.

There are three states for files on Btrfs: cow (normal), nocow (+C),
and a snapshot of a nocow (+C) file which is "cowandthennocow" or
whatever you want to call it. But yes a snapshot of a nocow file does
fragment a ton, but then becomes nocow and won't fragment more.

This explains one system's fragmented journals; but the other system
isn't snapshotting journals and I haven't figured out why they're so
fragmented. No snapshots, and they are all +C at create time
(systemd-journald default on Btrfs). Is it possible to prevent
journald from setting +C on /var/log/journal and
/var/log/journal/<machineid>? If I remove them, at next boot they get
reset, so any new journals created inherit that.

Anyway, snapshots of journals on Btrfs should be avoided for other
reasons. The autocleaning features (SystemMaxUse=, SystemKeepFree=) as
well as --vacuum-size=. ) don't work correctly when there are
snapshots of journals. Even when journald deletes journals, their
extents are pinned by snapshots, so they still take up the same space.
Basically journald could get into a situation where it deletes all
journals it sees, but no space is freed up because those journals are
stuck in a snapshot.

> I am not entirely sure what to recommend you. Ultimately whether btrfs
> fragments or not, is probably something you have to discuss with the
> btrfs folks. We do try to make the best of btrfs, by managing the COW
> flag, but this only helps you to a limited degree as
> snapshots/reflinks will fuck things up anyway...

Definitely.

An easy solution would be for journald to create
/var/log/journal/<machineid> as a subvolume instead of a directory.
This will make journals immune to snapshots of the containing
subvolume (typically root fs). Of course systemd already makes
subvolumes behind the scenes for other sane reasons like
/var/lib/machines.

Snapshotting logs strikes me as an invalid use case anyway. Anyone
would want logs immune to rollback, that'd defeat troubleshooting and
auditing. Logs should be linear and continuous, not rolled back. The
snapshotting is arguably a mistake, due to lack of user understanding
of the consequences. It is admittedly esoteric.

> We also ask btrfs to defrag the file as soon as we mark it as
> archived... I'd even be willing to extend on that, and defrag the file
> on other events too, for example if it ends up being too heavily
> fragmented. But last time I looked btrfs didn't have any nice API for
> that, that would have a clear focus on a single file only...

The biggest issue with them is they take up a lot of space and very
inconsistently defragment. Depending on kernel version they can become
magnificently larger.

Speaking of which, even with Compress=Yes (default), the journal files
are highly compressible. By copying some to a Btrfs volume with
compress mount option (this does not force compression it gives up
easily on already compressed data), I'm finding 4-6x smaller files. So
the journals are highly compressible. This is the last line for a
couple journals, from btrfs-progs/btrfs-debugfs:

file: system at 01b44589014542e3b48df31f152c0916-000000000000ca2b-00054546539416e8.journal
extents 384 disk size 9691136 logical size 50331648 ratio 5.19
file: system at 2547b430ecdc441c9cf82569eeb22065-0000000000000001-00054c3c31bec567.journal
extents 768 disk size 21504000 logical size 100663296 ratio 4.68

If there is a way to optimize this compression when rotating logs,
read-compress-write, this means defragmentation isn't needed on Btrfs,
and all file systems gain the benefit of much smaller logs.

-- 
Chris Murphy