[systemd-devel] consider dropping defrag of journals on btrfs

Mon Feb 8 15:38:18 UTC 2021

On Fr, 05.02.21 17:44, Chris Murphy (lists at colorremedies.com) wrote:

> On Fri, Feb 5, 2021 at 3:55 PM Lennart Poettering
> <lennart at poettering.net> wrote:
> >
> > On Fr, 05.02.21 20:58, Maksim Fomin (maxim at fomin.one) wrote:
> >
> > > > You know, we issue the btrfs ioctl, under the assumption that if the
> > > > file is already perfectly defragmented it's a NOP. Are you suggesting
> > > > it isn't a NOP in that case?
> > >
> > > So, what is the reason for defragmenting journal is BTRFS is
> > > detected? This does not happen at other filesystems. I have read
> > > this thread but has not found a clear answer to this question.
> >
> > btrfs like any file system fragments files with nocow a bit. Without
> > nocow (i.e. with cow) it fragments files horribly, given our write
> > pattern (wich is: append something to the end, and update a few
> > pointers in the beginning). By upstream default we set nocow, some
> > downstreams/users undo that however. (this is done via tmpfiles,
> > i.e. journald doesn't actually set nocow ever).
>
> I don't see why it's upstream's problem to solve downstream decisions.
> If they want to (re)enable datacow, then they can also setup some kind
> of service to defragment /var/log/journal/ on a schedule, or they can
> use autodefrag.

There are good reasons to enable cow, even if we default to
nocow. RAID, checksumming, compression, all that. It's not clear that
nocow is perfect, and cow is terrible or vice versa — in reality it's
a very blurry line, and hence we should support both modes, even if we
pick a default we think is in average the better choice. But because
we support both modes and because defragmentation of an unfragmented
file should be a NOP we issue the defrag ioctl too.

Moreover, if we wouldn't issue the defrag ioctl, there's no way to get
it.

I mean, to turn this into something constructive: please send a patch
that adds an env var $SYSTEMD_JOURNAL_BTRFS_DEFRAG which when set to 0
will turn off the defrag. If you want to disable this locally, then I
am happy to merge a patch that makes that configurable.

> > When we archive a journal file (i.e stop writing to it) we know it
> > will never receive any further writes. It's a good time to undo the
> > fragmentation (we make no distinction whether heavily fragmented,
> > little fragmented or not at all fragmented on this) and thus for the
> > future make access behaviour better, given that we'll still access the
> > file regularly (because archiving in journald doesn't mean we stop
> > reading it, it just means we stop writing it — journalctl always
> > operates on the full data set). defragmentation happens in the bg once
> > triggered, it's a simple ioctl you can invoke on a file. if the file
> > is not fragmented it shouldn't do anything.
>
> ioctl(3, BTRFS_IOC_DEFRAG_RANGE, {start=0, len=16777216, flags=0,
> extent_thresh=33554432, compress_type=BTRFS_COMPRESS_NONE}) = 0
>
> What 'len' value does journald use?

We don't call BTRFS_IOC_DEFRAG_RANGE.

Instead, we call BTRFS_IOC_DEFRAG with a NULL parameter.

> > other file systems simply have no such ioctl, and they never fragment
> > as terribly as btrfs can fragment. hence we don't call that ioctl.
>
> I did explain how to avoid the fragmentation in the first place, to
> obviate the need to defragment.
>
> 1. nodatacow. journald does this already
> 2. fallocate the intended final journal file size from the start,
> instead of growing them in 8MB increments.

Not an option, as mentioned. We maintain a bunch of journal files in
parallel, and if we would allocate them 100% in advance, then we'd
have really shitty behaviour since we'd allocate a ton of space on
disk we don't actually use, but nonetheless already took away from
everything else.

> 3. Don't reflink copy (including snapshot) the journals. This arguably
> is not journald's responsibility but as it creates both the journal/
> directory and $MACHINEID directory, it could make one or both of them
> as subvolumes instead to ensure they're not subject to snapshotting
> from above.

That's nonsense. People do recursive snapshots. nspawn does, machined
does, and so do others. Also, even if recursive snapshots didn't
exist, I am pretty sure people might be annoyed if we just fuck with
their backup strategy, and exclude some files.

> > I'd even be fine dropping it entirely, if someone actually can
> > show the benefits of having the files unfragmented when archived
> > don't outweigh the downside of generating some iops when executing
> > the defragmentation.
>
> I showed that the archived journals have way more fragmentation than
> active journals.

Can you report this to the btrfs maintainers?

Apparently defragmentation is broken on your btrfs then?

(I don't see that here btw)

> And the fragments in active journals are insignificant, and can even
> be reduced by fully allocating the journal file to final size rather
> than appending - which has a good chance of fragmenting the file on
> any file system, not just Btrfs.

Yeah, but it's not an option, see above, and other mails.

> Further, even *despite* this worse fragmentation of the archived
> journals, bcc-tools fileslower shows no meaningful latency as a
> result. I wrote this in the previous email. I don't understand what
> you want me to show you.

Get a rotating hdd and an ssd, do some performance measurements,
latency of journalctl invocations doing some filtering with a lot of
journal files. First fragged, then defragged.

> And since journald offers no ability to disable the defragment on
> Btrfs, I can't really do a longer term A/B comparison can I?

Patch it out temporarily? Or prep a patch that does the env var thing
I proposed above, and we can merge that.

> > optimizations without any data that actually shows it betters things.
>
> I did provide data. That you don't like what the data shows: archived
> journals have more fragments than active journals, is not my fault.
> The existing "optimization" is making things worse, in addition to
> adding a pile of unnecessary writes upon journal rotation.

I asked for latency during access, and iops during fragmentation? I
don't see that data anywhere.

You keep talk of "write amplification" (which I am not even sure
applies here, a defrag run isn't precisely a simply "write" event,
i.e. you are mixing up cause and result here..), and that's something
one measures in iops (or in written bytes). That's the data I am asking for.

> Conversely, you have not provided data proving that nodatacow
> fallocated files on Btrfs are any more fragmented than fallocated
> files on ext4 or XFS.

Well, you want to change something, it's on you to show what you
propose works better in most cases and not too much worse in all
others.

I am not asking for crazy stuff, I am just asking for some profiling,
i.e. access times during journalctl access, and generated iops caused
by defrag, compared by the iops our journal write pattern costs
without it.

> 2-17 fragments on ext4:
> https://pastebin.com/jiPhrDzG
> https://pastebin.com/UggEiH2J

We don't do defragmentation to have few fragments, but to optimize
access time. Hence, just showing the number of frags is kinda
pointless. What matters are access times. And no, this is not the
same, since 25 fragments ordered by offset on disk is great. 25
fragments in reverse order is terrible.

Lennart

--
Lennart Poettering, Berlin