[systemd-devel] journal fragmentation on Btrfs

Tue Apr 18 04:27:41 UTC 2017

On Mon, Apr 17, 2017 at 10:05 PM, Andrei Borzenkov <arvidjaar at gmail.com> wrote:
> 18.04.2017 06:50, Chris Murphy пишет:

>>> What exactly "changes" mean? Write() syscall?
>>
>> filefrag reported entries increase, it's using FIEMAP.
>>
>
> So far it sounds like btrfs allocates new extent on every write to
> journal file. Each journal record itself is relatively small indeed.

Hence why it would be better if there's no fsync so that it can
accumulate these and do its own commit (30s default for Btrfs) and let
them accumulate.

It is likely that the ssd allocation option on these ssd's is a factor
in fragmentation because it's trying to allocation to a unique 2MB
section based on expected erase block size. There's a lot of
discussion going on right now on the Btrfs list whether these
assumptions are still true, and in what cases maybe we should be using
nossd on higher end SSD's and NVMe.

What's for sure though is that with any of these allocators, nocow is
not good for lower end SSDs like SD cards; all that does it ask to
write to the same LBA over and over and over again, for a journal. And
it just increases write amplification unnecessarily. So I'm beginning
to think that on SSDs, it's better if journald did +c rather than +C
on journals. But there's still some researching to do.

I definitely think /var/log/journal/<machineid> should be a subvolume
to avoid its contents being snapshot. That does make the fragmentation
problem worse.

And also I think defragmentation feature should be disabled at least
on SSD; or should include zlib compression. The write amplification on
SSD is worse than just leaving the file fragmented.

>
>> Also with stat I see the times (all three) change on the file. If I go
>> to GNOME Terminal and just sudo some command, that itself causes the
>> current system.journal file to get all three times modified. It
>> happens immediately, there's no delay. So if I'm doing something like
>> drm.debug=0x1e, which is spitting a bunch of stuff to dmesg and thus
>> the journal, it's just constantly writing stuff to the journal. This
>> is without anything running journalctl -f or reading the journal.
>>
>>>
>>>> #Storage=auto
>>>> #Compress=yes
>>>> #Seal=yes
>>>> #SplitMode=uid
>>>> #SyncIntervalSec=5m
>>>
>>> This controls how often systemd calls fsync() on currently active
>>> journal file. Do you see fsync() every 3 seconds?
>>
>> I have no idea if it's fsync or what. How can I tell?
>>
>
> strace -p $(pgrep systemd-journal)
>
> You will not see actual writes as file is memory mapped, but it
> definitely does not do any fsync() every so often.
>
> Is it possible that btrfs behavior you observe is specific to memory
> mapped files handling?

Maybe. But even after a reboot I see the same extent entries in the
file. Granted a good deal of these 1 block entries have addresses that
are one after the other so they often make up larger continuous
extents, but they still have separate entries.

>
>> Also, I don't think these journal files are being compressed.
>>
>> Using the btrfs-progs/btrfs-debugfs script on a few user journal
>> files, I'm seeing massive compression ratios. Maybe I'll try
>> Compress=No and see if there's a change.
>>
>
> Only actual message payload above some threshold (I think 256 or 512
> bytes, not sure) is compressed; everything else is not. For average
> syslog-type messages payload is far too small. This is really only
> interesting when you store core dump or similar.

Interesting I see. Thanks.

I'll try strace and see what's going on.

-- 
Chris Murphy