[systemd-devel] Slow startup of systemd-journal on BTRFS

Sat Jun 14 22:43:06 PDT 2014

Goffredo Baroncelli posted on Sat, 14 Jun 2014 09:52:39 +0200 as
excerpted:

> On 06/14/2014 04:53 AM, Duncan wrote:
>> Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as
>> excerpted:
>> 
>>> thanks for pointing that. However I am performing my tests on a fedora
>>> 20 with systemd-208, which seems have this change
>>>
>>> I am reaching the conclusion that fallocate is not the problem. The
>>> fallocate increase the filesize of about 8MB, which is enough for some
>>> logging. So it is not called very often.

Right.

>> But...

Exactly, _but_...

>> [A]n fallocate of 8 MiB will increase the file size
>> by 8 MiB and write that out.  So far so good as at that point the 8 MiB
>> should be a single extent.  But then, data gets written into 4 KiB
>> blocks of that 8 MiB one at a time, and because btrfs is COW, the new
>> data in the block must be written to a new location.
>> 
>> Which effectively means that by the time the 8 MiB is filled, each 4
>> KiB block has been rewritten to a new location and is now an extent
>> unto itself.  So now that 8 MiB is composed of 2048 new extents, each
>> one a single 4 KiB block in size.
> 
> Several people pointed fallocate as the problem. But I don't understand
> the reason.
> 1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time
> during the boot. Often never because the log are less than 8MB.
> 2) it is true that btrfs "rewrite" almost 2 times each 4kb page with
> fallocate. But the first time is a "big" write of 8MB; instead the
> second write would happen in any case. What I mean is that without the
> fallocate in any case journald would make small write.
> 
> To be honest, I fatigue to see the gain of having a fallocate on a COW
> filesystem... may be that I don't understand very well the fallocate()
> call.

The base problem isn't fallocate per se, rather, tho it's the trigger in 
this case.  The base problem is that for COW-based filesystems, *ANY* 
rewriting of existing file content results in fragmentation.

It just so happens that the only reason there's existing file content to 
be rewritten (as opposed to simply appending) in this case, is because of 
the fallocate.  The rewrite of existing file content is the problem, but 
the existing file content is only there in this case because of the 
fallocate.

Taking a step back...

On a non-COW filesystem, allocating 8 MiB ahead and writing into it 
rewrites into the already allocated location, thus guaranteeing extents 
of 8 MiB each, since once the space is allocated it's simply rewritten in-
place.  Thus, on a non-COW filesystem, pre-allocating in something larger 
than single filesystem blocks when an app knows the data is eventually 
going to be written in to fill that space anyway is a GOOD thing, which 
is why systemd is doing it.

But on a COW-based filesystem fallocate is the exact opposite, a BAD 
thing, because an fallocate forces the file to be written out at that 
size, effectively filled with nulls/blanks.  Then the actual logging 
comes along and rewrites those nulls/blanks with actual data, but it's 
now a rewrite, which on a COW, copy-on-write, based filesystem, the 
rewritten block is copied elsewhere, it does NOT overwrite the existing 
null/blank block, and "elsewhere" by definition means detached from the 
previous blocks, thus in an extent all by itself.

Once the full 2048 original blocks composing that 8 MiB are filled in 
with actual data, because they were rewrites from null/blank blocks that 
fallocate had already forced to be allocated, that's now 2048 separate 
extents, 2048 separate file fragments, where without the forced fallocate, 
the writes would have all been appends, and there would have been at 
least /some/ chance of some of those 2048 separate blocks being written 
at close enough to the same time that they would have been written 
together as a single extent.  So while the 8 MiB might not have been a 
single extent as opposed to 2048 separate extents, it might have been 
perhaps 512 or 1024 extents, instead of the 2048 that it ended up being 
because fallocate meant that each block was a rewrite into an existing 
file, not a new append-write at the end of an existing file.

> [...]
>> Another alternative is that distros will start setting /var/log/journal
>> NOCOW in their setup scripts by default when it's btrfs, thus avoiding
>> the problem.  (Altho if they do automated snapshotting they'll also
>> have to set it as its own subvolume, to avoid the
>> first-write-after-snapshot-
>> is-COW problem.)  Well, that, and/or set autodefrag in the default
>> mount options.
> 
> Pay attention, that this remove also the checksum, which are very useful
> in a RAID configuration.

Well, it can be.  But this is only log data, not executable or the like 
data, and (as Kai K points out) journald has its own checksumming method 
in any case.

Besides which, you still haven't explained why you can't either set the 
autodefrag mount option and be done with it, or run a systemd-timer-
triggered or cron-triggered defrag script to defrag them automatically at 
hourly or daily or whatever intervals.  Those don't disable btrfs 
checksumming, but /should/ solve the problem.

(Tho if you're btrfs snapshotting the journals defrag has its own 
implications, but they should be relatively limited in scope compared to 
the fragmentation issues we're dealing with here.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman