[systemd-devel] consider dropping defrag of journals on btrfs

Lennart Poettering lennart at poettering.net
Thu Feb 4 13:49:12 UTC 2021


On Mi, 03.02.21 22:51, Chris Murphy (lists at colorremedies.com) wrote:

> > > Since systemd-journald sets nodatacow on /var/log/journal the journals
> > > don't really fragment much. I typically see 2-4 extents for the life
> > > of the journal, depending on how many times it's grown, in what looks
> > > like 8MiB increments. The defragment isn't really going to make any
> > > improvement on that, at least not worth submitting it for additional
> > > writes on SSD. While laptop and desktop SSD/NVMe can handle such a
> > > small amount of extra writes with no meaningful impact to wear, it
> > > probably does have an impact on much more low end flash like USB
> > > sticks, eMMC, and SD Cards. So I figure, let's just drop the
> > > defragmentation step entirely.
> >
> > Quite frankly, given how iops-expensive btrfs is, one probably
> > shouldn't choose btrfs for such small devices anyway. It's really not
> > where btrfs shines, last time I looked.
>
> Btrfs aggressively delays metadata and data allocation, so I don't
> agree that it's expensive.

It's not a matter of agreeing or not. Last time people showed me
benchmarks (which admittedly was 2 or 3 years ago), the number of iops
for typical workloads is typically twice as much as on ext4. Which I
don't really want to criticize, it's just the way that it is. I mean,
maybe they managed to lower the iops since then, but it's not a matter
of "agreeing", it's a matter of showing benchmarks that indicate this
is not a problem anymore.

> But in any case, reading a journal file and rewriting it out, which is
> what defragment does, doesn't really have any benefit given the file
> doesn't fragment much anyway due to (a) nodatacow and (b) fallocate,
> which is what systemd-journald does on Btrfs.

Well, at least on my system here there are still like 20 fragments per
file. That's not nothin?

> > Did you actually check the iops this generates?
>
> I don't understand the relevance.

You want to optimize write pattersn I understand, i.e. minimize
iops. Hence start with profiling iops, i.e. what defrag actually costs
and then weight that agains the reduced access time when accessing the
files. In particular on rotating media.

> > Not sure it's worth doing these kind of optimizations without any hard
> > data how expensive this really is. It would be premature.
>
> Submitting the journal for defragment in effect duplicates the
> journal. Read all extents, and rewrite those blocks to a new location.
> It's doubling the writes for that journal file. It's not like the
> defragment is free.

No, but doing this once in a big linear stream when the journal is
archived might not be so bad if then later on things are much faster
to access for all future because the files aren't fragmented.

> Somehow I think you're missing what I've asking for, which is to stop
> the unnecessary defragment step because it's not an optimization. It
> doesn't meaningfully reduce fragmentation at all, it just adds write
> amplification.

Somehow I think you are missing what I am asking for: some data that
actually shows your optimization is worth it: i.e. that leaving the
files fragment doesn't hurt access to the journal badly, and that the
number of iops is substantially lowered at the same time.

The thing is that we tend to have few active files and many archived
files, and since we interleave stuff our access patterns are pretty
bad already, so we don't want to spend even more time on paying for
extra bad access patterns becuase the archived files are fragment.

Lennart

--
Lennart Poettering, Berlin


More information about the systemd-devel mailing list