[systemd-devel] journal fragmentation on Btrfs

Sat Apr 22 12:29:26 UTC 2017

18.04.2017 07:27, Chris Murphy пишет:
> On Mon, Apr 17, 2017 at 10:05 PM, Andrei Borzenkov <arvidjaar at gmail.com> wrote:
>> 18.04.2017 06:50, Chris Murphy пишет:
> 
>>>> What exactly "changes" mean? Write() syscall?
>>>
>>> filefrag reported entries increase, it's using FIEMAP.
>>>
>>
>> So far it sounds like btrfs allocates new extent on every write to
>> journal file. Each journal record itself is relatively small indeed.
> 
> Hence why it would be better if there's no fsync so that it can
> accumulate these and do its own commit (30s default for Btrfs) and let
> them accumulate.
> 

It is not related to fsync. I made some tests. Journald does not appear
to preallocate file nor mmap the whole file (at least as far as I can
see from the source); when it appends new record it basically does

fallocate (fd, end_of_file, new_size)
mmap (fd, end_of_file, new_size)
write to new size

This results in large number of extents as each fallocate() ends up in
new extent.

I can easily reproduce it with small program that is using similar
pattern; actually mmap is also red herring. Just fallocat'ing file in
small increments gives file consisting of overly large number of
extents. How exactly those extents get distributed across device
probably depends on overall filesystem activity.

This is different from simply writing to file at the end, which still
results in several extent, but significantly larger.

BTW you get the same pattern from direct IO. Writing 100M file in 4K
blocks using cached writes gives me here 7 extents of size between 25M
and 500K. Writing the same with direct IO results in 25600 extents (same
as growing file in 4K steps with fallocate).