[systemd-devel] consider dropping defrag of journals on btrfs

Chris Murphy lists at colorremedies.com
Sat Feb 6 00:44:03 UTC 2021


On Fri, Feb 5, 2021 at 3:55 PM Lennart Poettering
<lennart at poettering.net> wrote:
>
> On Fr, 05.02.21 20:58, Maksim Fomin (maxim at fomin.one) wrote:
>
> > > You know, we issue the btrfs ioctl, under the assumption that if the
> > > file is already perfectly defragmented it's a NOP. Are you suggesting
> > > it isn't a NOP in that case?
> >
> > So, what is the reason for defragmenting journal is BTRFS is
> > detected? This does not happen at other filesystems. I have read
> > this thread but has not found a clear answer to this question.
>
> btrfs like any file system fragments files with nocow a bit. Without
> nocow (i.e. with cow) it fragments files horribly, given our write
> pattern (wich is: append something to the end, and update a few
> pointers in the beginning). By upstream default we set nocow, some
> downstreams/users undo that however. (this is done via tmpfiles,
> i.e. journald doesn't actually set nocow ever).

I don't see why it's upstream's problem to solve downstream decisions.
If they want to (re)enable datacow, then they can also setup some kind
of service to defragment /var/log/journal/ on a schedule, or they can
use autodefrag.


> When we archive a journal file (i.e stop writing to it) we know it
> will never receive any further writes. It's a good time to undo the
> fragmentation (we make no distinction whether heavily fragmented,
> little fragmented or not at all fragmented on this) and thus for the
> future make access behaviour better, given that we'll still access the
> file regularly (because archiving in journald doesn't mean we stop
> reading it, it just means we stop writing it — journalctl always
> operates on the full data set). defragmentation happens in the bg once
> triggered, it's a simple ioctl you can invoke on a file. if the file
> is not fragmented it shouldn't do anything.

ioctl(3, BTRFS_IOC_DEFRAG_RANGE, {start=0, len=16777216, flags=0,
extent_thresh=33554432, compress_type=BTRFS_COMPRESS_NONE}) = 0

What 'len' value does journald use?

> other file systems simply have no such ioctl, and they never fragment
> as terribly as btrfs can fragment. hence we don't call that ioctl.

I did explain how to avoid the fragmentation in the first place, to
obviate the need to defragment.

1. nodatacow. journald does this already
2. fallocate the intended final journal file size from the start,
instead of growing them in 8MB increments.
3. Don't reflink copy (including snapshot) the journals. This arguably
is not journald's responsibility but as it creates both the journal/
directory and $MACHINEID directory, it could make one or both of them
as subvolumes instead to ensure they're not subject to snapshotting
from above.


> I'd even be fine dropping it
> entirely, if someone actually can show the benefits of having the
> files unfragmented when archived don't outweigh the downside of
> generating some iops when executing the defragmentation.

I showed that the archived journals have way more fragmentation than
active journals. And the fragments in active journals are
insignificant, and can even be reduced by fully allocating the journal
file to final size rather than appending - which has a good chance of
fragmenting the file on any file system, not just Btrfs.

Further, even *despite* this worse fragmentation of the archived
journals, bcc-tools fileslower shows no meaningful latency as a
result. I wrote this in the previous email. I don't understand what
you want me to show you.

And since journald offers no ability to disable the defragment on
Btrfs, I can't really do a longer term A/B comparison can I?


>i.e. someone
> does some profiling, on both ssd and rotating media. Apparently noone
> who cares about this apparently wants to do such research though, and
> hence I remain deeply unimpressed. Let's not try to do such
> optimizations without any data that actually shows it betters things.

I did provide data. That you don't like what the data shows: archived
journals have more fragments than active journals, is not my fault.
The existing "optimization" is making things worse, in addition to
adding a pile of unnecessary writes upon journal rotation.

Conversely, you have not provided data proving that nodatacow
fallocated files on Btrfs are any more fragmented than fallocated
files on ext4 or XFS.

2-17 fragments on ext4:
https://pastebin.com/jiPhrDzG
https://pastebin.com/UggEiH2J

That behavior is no different for nodatacow fallocated journals on
Btrfs. There's no point in defragmenting these no matter the file
system. I don't have to profile this on HDD, I know that even in the
best case you're not likely to get (certainly not guaranteed) to get
fewer fragments than this. Defrag on Btrfs is for the thousands of
fragments case, which is what you get with datacow journals.



-- 
Chris Murphy


More information about the systemd-devel mailing list