[systemd-devel] journal fragmentation on Btrfs

Mon Apr 24 16:07:52 UTC 2017

On Sat, 22.04.17 15:29, Andrei Borzenkov (arvidjaar at gmail.com) wrote:

> 18.04.2017 07:27, Chris Murphy пишет:
> > On Mon, Apr 17, 2017 at 10:05 PM, Andrei Borzenkov <arvidjaar at gmail.com> wrote:
> >> 18.04.2017 06:50, Chris Murphy пишет:
> > 
> >>>> What exactly "changes" mean? Write() syscall?
> >>>
> >>> filefrag reported entries increase, it's using FIEMAP.
> >>>
> >>
> >> So far it sounds like btrfs allocates new extent on every write to
> >> journal file. Each journal record itself is relatively small indeed.
> > 
> > Hence why it would be better if there's no fsync so that it can
> > accumulate these and do its own commit (30s default for Btrfs) and let
> > them accumulate.
> > 
> 
> It is not related to fsync. I made some tests. Journald does not appear
> to preallocate file nor mmap the whole file (at least as far as I can
> see from the source); when it appends new record it basically does
> 
> fallocate (fd, end_of_file, new_size)
> mmap (fd, end_of_file, new_size)
> write to new size
> 
> This results in large number of extents as each fallocate() ends up in
> new extent.
> 
> I can easily reproduce it with small program that is using similar
> pattern; actually mmap is also red herring. Just fallocat'ing file in
> small increments gives file consisting of overly large number of
> extents. How exactly those extents get distributed across device
> probably depends on overall filesystem activity.
> 
> This is different from simply writing to file at the end, which still
> results in several extent, but significantly larger.
> 
> BTW you get the same pattern from direct IO. Writing 100M file in 4K
> blocks using cached writes gives me here 7 extents of size between 25M
> and 500K. Writing the same with direct IO results in 25600 extents (same
> as growing file in 4K steps with fallocate).

BTW, we are not really married to any particular fancy semantics of
fallocate(). We call it mostly so that our later writes to the file
blocks using mmap() will not result in SIGBUS. There was also the hope
that letting the fs know in advance that we are about to append the
specified amount of bytes to the end of the file through mmap() would
be a good thing not a bad thing... Or in other words, we really don't
need fallocate() to actually go to disk or anything and actually write
anything. All we want is *reserve* some space for us...

Maybe this is something to report to the btrfs folks? It appears to me
their implementation of fallocate() does more than it has to according
to the docs.

Lennart

-- 
Lennart Poettering, Red Hat