[systemd-devel] Slow startup of systemd-journal on BTRFS
Goffredo Baroncelli
kreijack at libero.it
Sat Jun 14 00:52:39 PDT 2014
On 06/14/2014 04:53 AM, Duncan wrote:
> Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as
> excerpted:
>
>> On 06/13/2014 01:24 AM, Dave Chinner wrote:
>>> On Thu, Jun 12, 2014 at 12:37:13PM +0000, Duncan wrote:
>>>>
>>>> FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think
>>>> actually pretty much equally bad without NOCOW set on the file.
>>>
>>> So maybe it's been fixed in systemd since the last time I looked.
>>> Yup:
>>>
>>> http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal-
> file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58
>>>
>>> The reason it was changed? To "save a syscall per append", not to
>>> prevent fragmentation of the file, which was the problem everyone was
>>> complaining about...
>>
>> thanks for pointing that. However I am performing my tests on a fedora
>> 20 with systemd-208, which seems have this change
>>>
>>>> Why? Because btrfs data blocks are 4 KiB. With COW, the effect for
>>>> either 4 byte or 8 MiB file allocations is going to end up being the
>>>> same, forcing (repeated until full) rewrite of each 4 KiB block into
>>>> its own extent.
>>
>> I am reaching the conclusion that fallocate is not the problem. The
>> fallocate increase the filesize of about 8MB, which is enough for some
>> logging. So it is not called very often.
>
> But...
>
> If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with
> nodatacow), then an fallocate of 8 MiB will increase the file size by 8
> MiB and write that out. So far so good as at that point the 8 MiB should
> be a single extent. But then, data gets written into 4 KiB blocks of
> that 8 MiB one at a time, and because btrfs is COW, the new data in the
> block must be written to a new location.
>
> Which effectively means that by the time the 8 MiB is filled, each 4 KiB
> block has been rewritten to a new location and is now an extent unto
> itself. So now that 8 MiB is composed of 2048 new extents, each one a
> single 4 KiB block in size.
Several people pointed fallocate as the problem. But I don't understand the reason.
1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time during the boot. Often never because the log are less than 8MB.
2) it is true that btrfs "rewrite" almost 2 times each 4kb page with fallocate. But the first time is a "big" write of 8MB; instead the second write would happen in any case. What I mean is that without the fallocate in any case journald would make small write.
To be honest, I fatigue to see the gain of having a fallocate on a COW filesystem... may be that I don't understand very well the fallocate() call.
>
> =:^(
>
> Tho as I already stated, for file sizes upto 128 MiB or so anyway[2], the
> btrfs autodefrag mount option should at least catch that and rewrite
> (again), this time sequentially.
>
>> I have to investigate more what happens when the log are copied from
>> /run to /var/log/journal: this is when journald seems to slow all.
>
> That's an interesting point.
>
> At least in theory, during normal operation journald will write to
> /var/log/journal, but there's a point during boot at which it flushes the
> information accumulated during boot from the volatile /run location to
> the non-volatile /var/log location. /That/ write, at least, should be
> sequential, since there will be > 4 KiB of journal accumulated that needs
> to be transferred at once. However, if it's being handled by the forced
> pre-write fallocate described above, then that's not going to be the
> case, as it'll then be a rewrite of already fallocated file blocks and
> thus will get COWed exactly as I described above.
>
> =:^(
>
>
>> I am prepared a PC which reboot continuously; I am collecting the time
>> required to finish the boot vs the fragmentation of the system.journal
>> file vs the number of boot. The results are dramatic: after 20 reboot,
>> the boot time increase of 20-30 seconds. Doing a defrag of
>> system.journal reduces the boot time to the original one, but after
>> another 20 reboot, the boot time still requires 20-30 seconds more....
>>
>> It is a slow PC, but I saw the same behavior also on a more modern pc
>> (i5 with 8GB).
>>
>> For both PC the HD is a mechanical one...
>
> The problem's duplicable. That's the first step toward a fix. =:^)
I Hope so
>
>>> And that's now a btrfs problem.... :/
>>
>> Are you sure ?
>
> As they say, "Whoosh!"
>
[...]
> Another alternative is that distros will start setting /var/log/journal
> NOCOW in their setup scripts by default when it's btrfs, thus avoiding
> the problem. (Altho if they do automated snapshotting they'll also have
> to set it as its own subvolume, to avoid the first-write-after-snapshot-
> is-COW problem.) Well, that, and/or set autodefrag in the default mount
> options.
Pay attention, that this remove also the checksum, which are very useful in a RAID configuration.
>
> Meanwhile, there's some focus on making btrfs behave better with such
> rewrite-pattern files, but while I think the problem can be made /some/
> better, hopefully enough that the defaults bother far fewer people in far
> fewer cases, I expect it'll always be a bit of a sore spot because that's
> just how the technology works, and as such, setting NOCOW for such files
> and/or using autodefrag will continue to be recommended for an optimized
> setup.
>
> ---
> [1] "Properly" set NOCOW: Btrfs doesn't guarantee the effectiveness of
> setting NOCOW (chattr +C) unless the attribute is set while the file is
> still zero size, effectively, at file creation. The easiest way to do
> that is to set NOCOW on the subdir that will contain the file, such that
> when the file is created it inherits the NOCOW attribute automatically.
>
> [2] File sizes upto 128 MiB ... and possibly upto 1 GiB. Under 128 MiB
> should be fine, over 1 GiB is known to cause issues, between the two is a
> gray area that depends on the speed of the hardware and the incoming
> write-stream.
>
--
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
More information about the systemd-devel
mailing list