[systemd-devel] consider dropping defrag of journals on btrfs
Chris Murphy
lists at colorremedies.com
Thu Feb 4 05:51:14 UTC 2021
On Wed, Feb 3, 2021 at 9:41 AM Lennart Poettering
<lennart at poettering.net> wrote:
>
> On Di, 05.01.21 10:04, Chris Murphy (lists at colorremedies.com) wrote:
>
> > f27a386430cc7a27ebd06899d93310fb3bd4cee7
> > journald: whenever we rotate a file, btrfs defrag it
> >
> > Since systemd-journald sets nodatacow on /var/log/journal the journals
> > don't really fragment much. I typically see 2-4 extents for the life
> > of the journal, depending on how many times it's grown, in what looks
> > like 8MiB increments. The defragment isn't really going to make any
> > improvement on that, at least not worth submitting it for additional
> > writes on SSD. While laptop and desktop SSD/NVMe can handle such a
> > small amount of extra writes with no meaningful impact to wear, it
> > probably does have an impact on much more low end flash like USB
> > sticks, eMMC, and SD Cards. So I figure, let's just drop the
> > defragmentation step entirely.
>
> Quite frankly, given how iops-expensive btrfs is, one probably
> shouldn't choose btrfs for such small devices anyway. It's really not
> where btrfs shines, last time I looked.
Btrfs aggressively delays metadata and data allocation, so I don't
agree that it's expensive. There is a wandering trees problem that can
result in write amplification, that's a different problem. But via
native compression overall writes are proven to significantly reduce
overall writes.
But in any case, reading a journal file and rewriting it out, which is
what defragment does, doesn't really have any benefit given the file
doesn't fragment much anyway due to (a) nodatacow and (b) fallocate,
which is what systemd-journald does on Btrfs.
It'd make more sense to defragment only if the file is datacow. At
least then it also gets compressed, which isn't the case when it's
nodatacow.
>
> > Further, since they are nodatacow, they can't be submitted for
> > compression. There was a quasi-bug in Btrfs, now fixed, where
> > nodatacow files submitted for decompression were compressed. So we no
> > longer get that unintended benefit. This strengthens the case to just
> > drop the defragment step upon rotation, no other changes.
> >
> > What do you think?
>
> Did you actually check the iops this generates?
I don't understand the relevance.
>
> Not sure it's worth doing these kind of optimizations without any hard
> data how expensive this really is. It would be premature.
Submitting the journal for defragment in effect duplicates the
journal. Read all extents, and rewrite those blocks to a new location.
It's doubling the writes for that journal file. It's not like the
defragment is free.
> That said, if there's actual reason to optimize the iops here then we
> could make this smart: there's actually an API for querying
> fragmentation: we could defrag only if we notice the fragmentation is
> really too high.
FIEMAP isn't going to work in the case the files are being fragmented.
The Btrfs extent size becomes 128KiB in that case, and it looks like
massive fragmentation. So that needs to be made smarter first.
I don't have a problem submitting the journal for a one time
defragment upon rotation if it's datacow, if empty journal-nocow.conf
exists.
But by default, the combination of fallocate and nodatacow already
avoids all meaningful fragmentation, so long as the journals aren't
being snapshot. If they are, well, that too is a different problem. If
the user does that and we're still defragmenting the files, it'll
explode their space consumption because defragment is not snapshot
aware, it results in all shared extents becoming unshared.
> But quite frankly, this sounds polishing things after the horse
> already left the stable: if you want to optimize iops, then don't use
> btrfs. If you bought into btrfs, then apparently you are OK with the
> extra iops it generates, hence also the defrag costs.
Somehow I think you're missing what I've asking for, which is to stop
the unnecessary defragment step because it's not an optimization. It
doesn't meaningfully reduce fragmentation at all, it just adds write
amplification.
--
Chris Murphy
More information about the systemd-devel
mailing list