[systemd-devel] consider dropping defrag of journals on btrfs

Fri Feb 5 15:02:57 UTC 2021

Chris Murphy writes:

> But it gets worse. The way systemd-journald is submitting the journals
> for defragmentation is making them more fragmented than just leaving
> them alone.

Wait, doesn't it just create a new file, fallocate the whole thing, copy
the contents, and delete the original?  How can that possibly make
fragmentation *worse*?

> All of those archived files have more fragments (post defrag) than
> they had when they were active. And here is the FIEMAP for the 96MB
> file which has 92 fragments.

How the heck did you end up with nearly 1 frag per mb?

> If you want an optimization that's actually useful on Btrfs,
> /var/log/journal/ could be a nested subvolume. That would prevent any
> snapshots above from turning the nodatacow journals into datacow
> journals, which does significantly increase fragmentation (it would in
> the exact same case if it were a reflink copy on XFS for that matter).

Wouldn't that mean that when you take snapshots, they don't include the
logs?  That seems like an anti feature that violates the principal of
least surprise.  If I make a snapshot of my root, I *expect* it to
contain my logs.

> I don't get the iops thing at all. What we care about in this case is
> latency. A least noticeable latency of around 150ms seems reasonable
> as a starting point, that's where users realize a delay between a key
> press and a character appearing. However, if I check for 10ms latency
> (using bcc-tools fileslower) when reading all of the above journals at
> once:
>
> $ sudo journalctl -D
> /mnt/varlog33/journal/b51b4a725db84fd286dcf4a790a50a1d/ --no-pager
>
> Not a single report. None. Nothing took even 10ms. And those journals
> are more fragmented than your 20 in a 100MB file.
>
> I don't have any hard drives to test this on. This is what, 10% of the
> market at this point? The best you can do there is the same as on SSD.

The above sounded like great data, but not if it was done on SSD.  Of
course it doesn't cause latency on an SSD.  I don't know about market
trends, but I stopped trusting my data to SSDs a few years ago when my
ext4 fs kept being corrupted and it appeared that the FTL of the drive
was randomly swapping the contents of different sectors around when I
found things like the contents of a text file in a block of the inode
table or a directory.

> You can't depend on sysfs to conditionally do defragmentation on only
> rotational media, too many fragile media claim to be rotating.

It sounds like you are arguing that it is better to do the wrong thing
on all SSDs rather than do the right thing on ones that aren't broken.

> Looking at the two original commits, I think they were always in
> conflict with each other, happening within months of each other. They
> are independent ways of dealing with the same problem, where only one
> of them is needed. And the best of the two is fallocate+nodatacow
> which makes the journals behave the same as on ext4 where you also
> don't do defragmentation.

This makes sense.