[systemd-devel] Slow startup of systemd-journal on BTRFS

Sat Jun 14 00:52:39 PDT 2014

On 06/14/2014 04:53 AM, Duncan wrote:
> Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as
> excerpted:
> 
>> On 06/13/2014 01:24 AM, Dave Chinner wrote:
>>> On Thu, Jun 12, 2014 at 12:37:13PM +0000, Duncan wrote:
>>>>
>>>> FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think
>>>> actually pretty much equally bad without NOCOW set on the file.
>>>
>>> So maybe it's been fixed in systemd since the last time I looked.
>>> Yup:
>>>
>>> http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal-
> file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58
>>>
>>> The reason it was changed? To "save a syscall per append", not to
>>> prevent fragmentation of the file, which was the problem everyone was
>>> complaining about...
>>
>> thanks for pointing that. However I am performing my tests on a fedora
>> 20 with systemd-208, which seems have this change
>>>
>>>> Why?  Because btrfs data blocks are 4 KiB.  With COW, the effect for
>>>> either 4 byte or 8 MiB file allocations is going to end up being the
>>>> same, forcing (repeated until full) rewrite of each 4 KiB block into
>>>> its own extent.
>>
>> I am reaching the conclusion that fallocate is not the problem. The
>> fallocate increase the filesize of about 8MB, which is enough for some
>> logging. So it is not called very often.
> 
> But... 
> 
> If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with 
> nodatacow), then an fallocate of 8 MiB will increase the file size by 8 
> MiB and write that out.  So far so good as at that point the 8 MiB should 
> be a single extent.  But then, data gets written into 4 KiB blocks of 
> that 8 MiB one at a time, and because btrfs is COW, the new data in the 
> block must be written to a new location.
> 
> Which effectively means that by the time the 8 MiB is filled, each 4 KiB 
> block has been rewritten to a new location and is now an extent unto 
> itself.  So now that 8 MiB is composed of 2048 new extents, each one a 
> single 4 KiB block in size.

Several people pointed fallocate as the problem. But I don't understand the reason.
1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time during the boot. Often never because the log are less than 8MB.
2) it is true that btrfs "rewrite" almost 2 times each 4kb page with fallocate. But the first time is a "big" write of 8MB; instead the second write would happen in any case. What I mean is that without the fallocate in any case journald would make small write.

To be honest, I fatigue to see the gain of having a fallocate on a COW filesystem... may be that I don't understand very well the fallocate() call.

> 
> =:^(
> 
> Tho as I already stated, for file sizes upto 128 MiB or so anyway[2], the 
> btrfs autodefrag mount option should at least catch that and rewrite 
> (again), this time sequentially.
> 
>> I have to investigate more what happens when the log are copied from
>> /run to /var/log/journal: this is when journald seems to slow all.
> 
> That's an interesting point.
> 
> At least in theory, during normal operation journald will write to 
> /var/log/journal, but there's a point during boot at which it flushes the 
> information accumulated during boot from the volatile /run location to 
> the non-volatile /var/log location.  /That/ write, at least, should be 
> sequential, since there will be > 4 KiB of journal accumulated that needs 
> to be transferred at once.  However, if it's being handled by the forced 
> pre-write fallocate described above, then that's not going to be the 
> case, as it'll then be a rewrite of already fallocated file blocks and 
> thus will get COWed exactly as I described above.
> 
> =:^(
> 
> 
>> I am prepared a PC which reboot continuously; I am collecting the time
>> required to finish the boot vs the fragmentation of the system.journal
>> file vs the number of boot. The results are dramatic: after 20 reboot,
>> the boot time increase of 20-30 seconds. Doing a defrag of
>> system.journal reduces the boot time to the original one, but after
>> another 20 reboot, the boot time still requires 20-30 seconds more....
>>
>> It is a slow PC, but I saw the same behavior also on a more modern pc
>> (i5 with 8GB).
>>
>> For both PC the HD is a mechanical one...
> 
> The problem's duplicable.  That's the first step toward a fix. =:^)
I Hope so
> 
>>> And that's now a btrfs problem.... :/
>>
>> Are you sure ?
> 
> As they say, "Whoosh!"
> 
[...] 
> Another alternative is that distros will start setting /var/log/journal 
> NOCOW in their setup scripts by default when it's btrfs, thus avoiding 
> the problem.  (Altho if they do automated snapshotting they'll also have 
> to set it as its own subvolume, to avoid the first-write-after-snapshot-
> is-COW problem.)  Well, that, and/or set autodefrag in the default mount 
> options.

Pay attention, that this remove also the checksum, which are very useful in a RAID configuration.

> 
> Meanwhile, there's some focus on making btrfs behave better with such 
> rewrite-pattern files, but while I think the problem can be made /some/ 
> better, hopefully enough that the defaults bother far fewer people in far 
> fewer cases, I expect it'll always be a bit of a sore spot because that's 
> just how the technology works, and as such, setting NOCOW for such files 
> and/or using autodefrag will continue to be recommended for an optimized 
> setup.
> 
> ---
> [1] "Properly" set NOCOW:  Btrfs doesn't guarantee the effectiveness of 
> setting NOCOW (chattr +C) unless the attribute is set while the file is 
> still zero size, effectively, at file creation.  The easiest way to do 
> that is to set NOCOW on the subdir that will contain the file, such that 
> when the file is created it inherits the NOCOW attribute automatically.
> 
> [2] File sizes upto 128 MiB ... and possibly upto 1 GiB.  Under 128 MiB 
> should be fine, over 1 GiB is known to cause issues, between the two is a 
> gray area that depends on the speed of the hardware and the incoming 
> write-stream.
> 

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5