[systemd-devel] Slow startup of systemd-journal on BTRFS
Duncan
1i5t5.duncan at cox.net
Sat Jun 14 22:43:06 PDT 2014
Goffredo Baroncelli posted on Sat, 14 Jun 2014 09:52:39 +0200 as
excerpted:
> On 06/14/2014 04:53 AM, Duncan wrote:
>> Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as
>> excerpted:
>>
>>> thanks for pointing that. However I am performing my tests on a fedora
>>> 20 with systemd-208, which seems have this change
>>>
>>> I am reaching the conclusion that fallocate is not the problem. The
>>> fallocate increase the filesize of about 8MB, which is enough for some
>>> logging. So it is not called very often.
Right.
>> But...
Exactly, _but_...
>> [A]n fallocate of 8 MiB will increase the file size
>> by 8 MiB and write that out. So far so good as at that point the 8 MiB
>> should be a single extent. But then, data gets written into 4 KiB
>> blocks of that 8 MiB one at a time, and because btrfs is COW, the new
>> data in the block must be written to a new location.
>>
>> Which effectively means that by the time the 8 MiB is filled, each 4
>> KiB block has been rewritten to a new location and is now an extent
>> unto itself. So now that 8 MiB is composed of 2048 new extents, each
>> one a single 4 KiB block in size.
>
> Several people pointed fallocate as the problem. But I don't understand
> the reason.
> 1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time
> during the boot. Often never because the log are less than 8MB.
> 2) it is true that btrfs "rewrite" almost 2 times each 4kb page with
> fallocate. But the first time is a "big" write of 8MB; instead the
> second write would happen in any case. What I mean is that without the
> fallocate in any case journald would make small write.
>
> To be honest, I fatigue to see the gain of having a fallocate on a COW
> filesystem... may be that I don't understand very well the fallocate()
> call.
The base problem isn't fallocate per se, rather, tho it's the trigger in
this case. The base problem is that for COW-based filesystems, *ANY*
rewriting of existing file content results in fragmentation.
It just so happens that the only reason there's existing file content to
be rewritten (as opposed to simply appending) in this case, is because of
the fallocate. The rewrite of existing file content is the problem, but
the existing file content is only there in this case because of the
fallocate.
Taking a step back...
On a non-COW filesystem, allocating 8 MiB ahead and writing into it
rewrites into the already allocated location, thus guaranteeing extents
of 8 MiB each, since once the space is allocated it's simply rewritten in-
place. Thus, on a non-COW filesystem, pre-allocating in something larger
than single filesystem blocks when an app knows the data is eventually
going to be written in to fill that space anyway is a GOOD thing, which
is why systemd is doing it.
But on a COW-based filesystem fallocate is the exact opposite, a BAD
thing, because an fallocate forces the file to be written out at that
size, effectively filled with nulls/blanks. Then the actual logging
comes along and rewrites those nulls/blanks with actual data, but it's
now a rewrite, which on a COW, copy-on-write, based filesystem, the
rewritten block is copied elsewhere, it does NOT overwrite the existing
null/blank block, and "elsewhere" by definition means detached from the
previous blocks, thus in an extent all by itself.
Once the full 2048 original blocks composing that 8 MiB are filled in
with actual data, because they were rewrites from null/blank blocks that
fallocate had already forced to be allocated, that's now 2048 separate
extents, 2048 separate file fragments, where without the forced fallocate,
the writes would have all been appends, and there would have been at
least /some/ chance of some of those 2048 separate blocks being written
at close enough to the same time that they would have been written
together as a single extent. So while the 8 MiB might not have been a
single extent as opposed to 2048 separate extents, it might have been
perhaps 512 or 1024 extents, instead of the 2048 that it ended up being
because fallocate meant that each block was a rewrite into an existing
file, not a new append-write at the end of an existing file.
> [...]
>> Another alternative is that distros will start setting /var/log/journal
>> NOCOW in their setup scripts by default when it's btrfs, thus avoiding
>> the problem. (Altho if they do automated snapshotting they'll also
>> have to set it as its own subvolume, to avoid the
>> first-write-after-snapshot-
>> is-COW problem.) Well, that, and/or set autodefrag in the default
>> mount options.
>
> Pay attention, that this remove also the checksum, which are very useful
> in a RAID configuration.
Well, it can be. But this is only log data, not executable or the like
data, and (as Kai K points out) journald has its own checksumming method
in any case.
Besides which, you still haven't explained why you can't either set the
autodefrag mount option and be done with it, or run a systemd-timer-
triggered or cron-triggered defrag script to defrag them automatically at
hourly or daily or whatever intervals. Those don't disable btrfs
checksumming, but /should/ solve the problem.
(Tho if you're btrfs snapshotting the journals defrag has its own
implications, but they should be relatively limited in scope compared to
the fragmentation issues we're dealing with here.)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
More information about the systemd-devel
mailing list