[systemd-devel] consider dropping defrag of journals on btrfs

Chris Murphy lists at colorremedies.com
Tue Feb 9 06:52:26 UTC 2021


On Mon, Feb 8, 2021 at 8:20 AM Phillip Susi <phill at thesusis.net> wrote:
>
>
> Chris Murphy writes:
>
> > I showed that the archived journals have way more fragmentation than
> > active journals. And the fragments in active journals are
> > insignificant, and can even be reduced by fully allocating the journal
>
> Then clearly this is a problem with btrfs: it absolutely should not be
> making the files more fragmented when asked to defrag them.

I've asked. We'll see..


> > file to final size rather than appending - which has a good chance of
> > fragmenting the file on any file system, not just Btrfs.
>
> And yet, you just said the active journal had minimal fragmentation.

Yes, the extents are consistently 8MB in the nodatacow case, old and
new file system alike. Same as ext4 and XFS.

> That seems to mean that the 8mb fallocates that journald does is working
> well.  Sure, you could proabbly get fewer fragments by fallocating the
> whole 128 mb at once, but there are tradeoffs to that that are not worth
> it.  One fragment per 8 mb isn't a big deal.  Ideally a filesystem will
> manage to do better than that ( didn't btrfs have a persistent
> reservation system for this purpose? ), but it certainly should not
> commonly do worse.

I don't think any of the file systems guarantee a contiguous block
range upon fallocate, they only guarantee that writes to fallocated
space will succeed. i.e. it's a space reservation. But yeah in
practice, 8MB is small enough that chances are you'll see one 8MB
extent.

And I agree 8MB isn't a big deal. Does anyone complain about journal
fragmentation on ext4 or xfs? If not, then we come full circle to my
second email in the thread which is don't defragment when nodatacow,
only defragment when datacow. Or use BTRFS_IOC_DEFRAG_RANGE and
specify 8MB length. That does seem to consistently no op on nodatacow
journals which have 8MB extents.


> > Further, even *despite* this worse fragmentation of the archived
> > journals, bcc-tools fileslower shows no meaningful latency as a
> > result. I wrote this in the previous email. I don't understand what
> > you want me to show you.
>
> *Of course* it showed no meaningful latency because you did the test on
> an SSD, which has no meaningful latency penalty due to fragmentation.
> The question is how bad is it on HDD.

The reason I'm dismissive is because the nodatacow fragment case is
the same as ext4 and XFS; the datacow fragment case is both
spectacular and non-deterministic. The workload will matter where
these random 4KiB journal writes end up on an HDD. I've seen journals
with hundreds to thousands of extents. I'm not sure what we learn from
me doing a single isolated test on an HDD.

And also, only defragmenting on rotation strikes me as leaving
performance on the table, right? If there is concern about fragmented
archived journals, then isn't there concern about fragmented active
journals?

But it sounds to me like you want to learn what the performance is of
journals defragmented with BTFS_IOC_DEFRAG specifically? I don't think
it's interesting because you're still better off leaving nodatacow
journals alone, and something still has to be done in the datacow
case. It's two extremes. What the performance is doesn't matter, it's
not going to tell you anything you can't already infer from the two
layouts.


> > And since journald offers no ability to disable the defragment on
> > Btrfs, I can't really do a longer term A/B comparison can I?
>
> You proposed a patch to disable it.  Test before and after the patch.

Is there a test mode for journald to just dump a bunch of random stuff
into the journal to age it? I don't want to wait weeks to get a dozen
journal files.


>
> > I did provide data. That you don't like what the data shows: archived
> > journals have more fragments than active journals, is not my fault.
> > The existing "optimization" is making things worse, in addition to
> > adding a pile of unnecessary writes upon journal rotation.
>
> If it is making things worse, that is definately a bug in btrfs.  It
> might be nice to avoid the writes on SSD though since there is no
> benefit there.

Agreed.


-- 
Chris Murphy


More information about the systemd-devel mailing list