[systemd-devel] journal fragmentation on Btrfs

Mon Apr 17 14:01:48 UTC 2017

Am Mon, 17 Apr 2017 11:57:21 +0200
schrieb Lennart Poettering <lennart at poettering.net>:

> On Sun, 16.04.17 14:30, Chris Murphy (lists at colorremedies.com) wrote:
> 
> > Hi,
> > 
> > This is on a Fedora 26 workstation (systemd-233-3.fc26.x86_64)
> > that's maybe a couple weeks old and was clean installed. Drive is
> > NVMe.
> > 
> > 
> > # filefrag *
> > system.journal: 9283 extents found
> > user-1000.journal: 3437 extents found
> > # lsattr
> > ----------------C-- ./system.journal
> > ----------------C-- ./user-1000.journal
> > 
> > I do manual snapshots before software updates, which means new
> > writes to these files are subject to COW, but additional writes to
> > the same extents are overwrites and are not COW because of chattr
> > +C. I've used this same strategy for a long time, since
> > systemd-journald defaults to +C for journal files; but I've not
> > seen them get this fragmented this quickly.
> >  
> 
> IIRC NOCOW only has an effect if set right after the file is created
> before the first write to it is done. Or in other words, you cannot
> retroactively make a file NOCOW. This means that if you in one way or
> another make a COW copy of a file (through reflinking — implicit or
> not, note that "cp" reflinks by default — or through snapshotting or
> something else) the file is COW and you'll get fragmentation.

To mark a file nocow, it has to exist with zero bytes and never
been written to. The nocow attribute (chattr +C) will be inherited from
the directory upon creation of a file. So the best way to go is setting
+C on the directory and all future files of the journal would be nocow.

You can still do snapshots, nocow doesn't prohibit that and doesn't
make journals cow again. What happens is that btrfs simply unshares
extents as soon as you write to the snapshot. The newly created extent
itself will behave like nocow again. If the extents are big enough,
this shouldn't introduce any serious fragmentation, just waste space.
Btrfs won't split extents upon unsharing them during a write. It may,
however, "replace" only part of the unshared extent thus making three
new: two sharing the old copy, one having the new data. But since
journals are append only, that should be no problem. It's just that the
data is written so slowly that writes almost never become combined into
one single writes, resulting in many extents.

> I am not entirely sure what to recommend you. Ultimately whether btrfs
> fragments or not, is probably something you have to discuss with the
> btrfs folks. We do try to make the best of btrfs, by managing the COW
> flag, but this only helps you to a limited degree as
> snapshots/reflinks will fuck things up anyway...

Well, usually you shouldn't have to manage the cow flag at all: Just
set it once for the newly created journal directory and everything is
fine. And even then, people may not want this so they could easily
unset the flag on the directory and rotate the journal.

> We also ask btrfs to defrag the file as soon as we mark it as
> archived...

This makes sense. And I've learned that journal on btrfs works much
better if you use many small files vs. a few big files. I've currently
set the journal size limit to 8 MB for that reason which gives me very
good performance.

> I'd even be willing to extend on that, and defrag the file
> on other events too, for example if it ends up being too heavily
> fragmented.

Since the append behavior of btrfs is so bad wrt journal files, it
should be enough to simply let btrfs defrag the previous written
journal block upon append the file: Lennart, I think you are hinting the
OS that the file is going to grow and thus truncate it to 8 MB beyond
the current end of file to continue writing. That would be a good event
to let btrfs defrag the old 8 MB block (and just that, not the complete
file). If this works well, you could maybe skip defragging the complete
file upon rotation which should improve disk io performance during
rotation.

I think the default extent size hint for defragging with btrfs defrag
has been set to 32 MB lately, so it would be enough to maybe do the
above step every 32 MB.

> But last time I looked btrfs didn't have any nice API for
> that, that would have a clear focus on a single file only...

The high number of extents may not be an indicator for fragmentation
when btrfs compression is used. Compressed data will be organized in
logical 128k units which are reported as fragments to filefrag, in
reality they are laid out continuously on disk, so no fragmentation.
It would be interesting to see the blockmap of this.

-- 
Regards,
Kai

Replies to list-only preferred.